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so  that  the  reconstruction  task  in  formulated  as  an  estimation  problem. ^ 
Our  main  contributions  are  the  following: 

1.  We  introduce  the  use  of  specific  error  criteria  for  the  design  of 
the  optimal  Bayesian  estimators  for  several  classes  of  problems, 

and  propose  a  general  (Monte  Carlo)  procedure  for  approximating  them. 

This  new  approach  leads  to  a  substantial  improvement  over  the  existing 
schemes,  both  regarding  the  quality  of  the  results  (particularly  for 
low  signal  to  noise  ratios)  and  the  computational  efficiency 

2.  We  apply  the  Bayesian  approach  to  the  solution  of  several  problems, 
some  of  which  are  formulated  and  solved  in  these  terms  for  the  first 
time.  Specifically,  these  applications  are:  The  reconstruction  of 
piecewise  constant  functions  from  noisy  data;  the  reconstruction  of  piece 
wise  continuous  surfaces  from  sparse  and  noisy  observations;  the  recon¬ 
struction  of  depth  from  stereoscopic  pairs  of  images  and  the  formation 

of  perceptual  clusters. 

1.  For  each  one  of  these  applications,  we  develop  fast,  deterministic 
algorithms  that  approximate  the  optimal  estimators,  and  illustrate 
their  performance  on  both  synthetic  and  real  data. 

4.  We  propose  a  new  method,  based  on  the  analysis  of  the  residual  process 
for  estimating  the  parameters  of  the  probabilistic  models  directly 

from  noisy  observations.  This  scheme  leads  to  an  algorithm,  which  has 
no  free  parameters,  for  the  restoration  of  piecewise  uniform  images. 

5.  We  analyze  the  implementation  of  the  algorithms  that  we  develop 

in  nonconventional  hardware,  such  as  massively  parallel  digital  machines, 
and  analog  and  hybrid  networks. 
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Chapter  I 


INTRODUCTION 


A  fundamental  problem  in  the  design  and  analysis  of  systems  endowed  with 
perceptual  abilities  is  the  construction  of  internal  representations  of  the  physical 
structures  in  the  external  world.  The  precise  form  of  these  representations  is  not  well 
understood,  and  is  the  subject  of  much  current  research  in  Artificial  Intelligence 
and  Psychology.  It  is  clear,  however,  that  these  representations  should  integrate 
prior  generic  knowledge  about  the  physical  properties  of  the  external  world  with 
measurements  from  a  number  of  different  sensory  modalities.  Furthermore,  in 
order  to  be  effectively  action-oriented,  the  representations  should  provide  compact 
descriptions  of  the  physical  structures  of  interest  at  different  levels  of  detail. 

This  problem  is  not  exclusive  of  biological  perceptual  systems;  it  arises 
whenever  information  from  a  set  of  sensors  has  to  be  processed,  stored  and  retrieved 
in  an  efficient  way.  Thus,  it  is  of  fundamental  importance,  for  example,  in  the 
design  of  computer  vision  systems;  in  the  reconstruction  of  subterranean  geological 
structures  from  geophysical  data  and  in  the  design  of  biomedical  imaging  systems. 
The  motivation  for  this  thesis  is  to  increase  our  understanding  of  the  principles 
underlying  the  process  of  integrating  prior  generic  constraints  with  the  available 
observations,  for  the  construction  of  these  representations.  In  particular,  we  will 
address  the  problem  of  reconstructing,  in  a  way  that  is  consistent  with  the  available 
sensory  data,  the  value  of  certain  properties  of  the  physical  structure  of  interest  over 
a  discretized  region  of  space. 

To  define  these  early  perceptual  processes  in  a  more  precise  way,  let  us  model 
the  specific  properties  of  the  physical  structure  as  functions  /  that  map  a  (compact) 
region  fi  C  Zn  into  Zm.  In  the  most  interesting  cases,  /  will  be  either  a  scalar 
(m  =  l)  or  a  vector  field  (m  =  2)  defined  on  a  two-dimensional  region.  This  is 
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the  case,  for  example,  of  the  problems  of  image  restoration  and  segmentation,  and 
of  the  recovery  of:  depth  from  stereo:  lightness:  shape  from  shading;  and  the 
computation  of  optical  flow  in  computer  vision,  as  well  as  many  problems  in  the 
recover)  of  geological  structure  from  geophysical  measurements. 

We  will  assume  that  the  available  data  consists  of  several  sets  of  qualitatively 
different  measurements  {gi,---,g\i}  that  in  general  are  modeled  as: 

9l  =  //,(/,  Z)/,D2/,...,nt) 

where  Df  denotes  the  derivative  of  the  property  /;  n,  is  a  noise  process,  and  //, 
is  some  operator  (for  example,  in  vision  problems,  the  different  measurements  may 
correspond  to:  stereo  disparity;  brightness;  color,  etc.).  We  will  also  assume  that 
this  information  is  collected  with  different  sampling  patterns  {Si, . . SM },  that  is, 
the  observations  g,  are  defined  only  on  the  finite  set  S,  C  fl.  Since  most  physical 
phenomena  consist  of  events  that  occur  at  a  variety  of  scales,  and  in  general, 
events  at  widely  different  scales  have  little  influence  on  one  another,  the  numerical 
descriptions  of  the  behavior  of  a  property  over  a  range  of  scales  can  be  used 
effectively  to  produce  a  physically  meaningful  hierarchical  decomposition  of  the 
original  structure  into  individual  substructures  ("objects")  which  can  be  subsequently 
described  in  symbolic  forms  that  are  more  compact  and  easy  to  manipulate  (see 
Vlarr,  1976  and  1982;  it  is  not  surprising  that  there  is  psychophysical  evidence 
suggesting  the  presence  of  a  multiscale  processing  hierarchy  in  the  human  visual 
system;  see  Campbell  and  Robson,  1977,  and  Marroquin,  1976). 

Thus,  the  solutions  we  are  looking  for  consist  on  a  family  {/Q}  of  numerical 
descriptions  of  the  function  /  at  different  scales  (indexed  by  a)  at  the  sites  of  some 
lattice  L  Cfl  (the  finest  scale  representation  should  correspond  to  the  best  estimate 
of  the  actual  value  of  /  at  the  sites  of  L).  To  illustrate  this  idea,  in  figure  1-a  we 
present  a  binary  pattern,  and  in  figures  1-b  through  1-e,  its  numerical  representation 
at  increasingly  coarser  scales.  This  family  of  descriptions  was  generated  by  the 
algorithm  described  in  section  5  of  chapter  4. 

In  general,  the  observation  processes  g,  do  not  determine  the  value  of  /  in  a 
unique  and  stable  way  (that  is  to  say,  these  problems  are  ill-posed  in  the  sense  of 
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Hadamard:  see  Poggio  and  Torre,  1984).  Therefore,  the  algorithms  we  are  looking 
for  should  be  able  to  regularize  the  problem  by  incorporating  constraints  on  the 
solution  generated  by  some  prior  knowledge  about  its  general  characteristics. 

Finally,  because  of  the  large  number  of  variables  involved,  reasonable  speed 
of  performance  will  usually  require  that  these  algorithms  be  distributed,  and  thus, 
efficiently  implementable  in  parallel  hardware. 

1.  Regularization  Analysis  and  Cooperative  Algorithms. 

Among  the  most  successful  solutions  to  these  type  of  problems  are  those 
that  formulate  them  as  variational  problems,  where  the  measurement  and  generic 
constraints  are  separately  represented  in  the  following  way: 

Let  us  consider  the  case  of  only  one  set  of  "perfect”  measurements  (i.e.,  with 
no  noise)  g  defined  on  the  set  5,  and  suppose  that  the  constraints  that  they  impose 
on  the  solution  can  be  expressed  in  the  form: 


“V"  ^  t  ^  ^  •  .■  T’.'v-v'  .  -  » 


where  A  is  a  positive  definite,  real  valued  function  that  measures  the  incompatibility 
of  the  value  of  the  property  /  with  the  observations  g.  In  general,  the  observations 
will  not  be  perfect,  and  so,  we  will  only  require  that  the  error  /.s  A(f,y)  be  small. 
However,  there  may  be  a  large  number  of  configurations  that  minimize  the  error. 
To  find  a  unique  solution,  m  assumption  about  the  global  smoothness  of  /  is 
introduced  by  means  of  some  positive  definite,  real  valued  function  P(f,Df,...) 
which  measures  the  "jaggedness"  of  /.  If  both  A  and  P  are  convex,  the  desired 
solution  w  ill  be  the  unique  minimizer  of  the  "energy"  functional: 

mi,  o)  =  fs  as,  »>+>./,  n/.  or, ...)  (i) 

where  X  is  a  parameter. 

lh is  approach  has  been  applied  with  varying  degrees  of  success  to  the 
problems  of  surface  interpolation  (Grimson,  1981b,  1982;  Terzoupulos,  1983, 
1984a);  computation  of  visual  motion  (Horn  and  Schunk,  1981;  Hildreth,  1984a,b); 
recovery  of  shape  from  shading  information  (Ikeuchi  and  Horn,  1981);  computation 
of  subjective  contours  (Ullman,  1976;  Brady  et  al„  1980;  Horn,  1981);  lightness 
(Horn,  1974),  and  edge  detection  (Torre  and  Poggio,  1983). 

In  a  recent  paper,  Poggio  and  Torre  (1984)  have  shown  how  functionals  of 
the  form  of  equation  (1)  can  be  derived  in  a  rigorous  and  systematic  way  using 
regularization  methods  (Tikhonov,  1963;  Tikhonov  and  Arsenin  (1977);  Wahba 
(1980);  in  this  context  /n  P  is  called  a  stabilizing  functional,  and  X,  the  regularization 
parameter). 

Once  the  functional  (1)  is  specified,  its  minimization  can  be  carried  out  by 
standard  variational  methods  (Courant  and  Hilbert,  1953).  Since  usually  one  is 
interested  in  the  value  of  /  only  at  the  discrete  set  of  points  L,  the  solution  of  the 
resulting  Euler- Lagrange  partial  differential  equations  can  be  obtained  as  the  fixed 
point  of  a  relaxation  (cooperative)  algorithm  of  the  form: 

/!*+,)  =  W(*>)  »  €  L  (2) 

This  algorithm  can  be  efficiently  implemented  in  parallel  hardware  using  a 
network  of  locally  connected  processors  (one  for  each  site  *'),  or  even  by  some 
analog  network  (see  Poggio  and  Koch,  1984). 
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it  is  interesting  to  note  that  it  is  also  possible,  and  sometimes  easier,  to 
embed  the  prior  knowledge  about  the  solution,  and  the  constraints  imposed  by  the 
observations,  directly  in  a  cooperative  network  of  a  given  form,  without  explicitly 
defining  a  global  variational  principle.  This  approach  has  been  used  by  Marr  and 
Poggio  (1976)  for  the  stereo  matching  problem.  We  will  have  more  to  say  about  it 
in  chapter  6. 

It  is  also  possible,  in  principle,  to  incorporate  qualitatively  different  measure¬ 
ments  into  a  single  cooperative  process,  by  a  simple  modification  of  the  energy 
functional: 

Suppose  that  we  have  M  sets  of  measurements,  and  that  each  set  g,  places  some 
constraints  on  /  (and/or  its  derivatives)  which  can  be  expressed  by  the  functionals: 

The  solution  will  now  be  constructed  as  the  global  minimizer  of  the  functional: 

VU)  =  E  <*.(/. 9)  JSi  A  +  x/n  P(f, Df, . . .)  (3) 

where  the  parameters  o,  measure  the  relative  weight  we  wish  to  assign  to  each  set 
of  measurements. 

If  all  the  functions  A,  are  convex,  the  solution  will  again  be  unique,  and 
the  minimization  of  (3)  may  be  carried  out  by  means  of  a  cooperative  network 
(this  approach  has  been  used  by  Terzopoulos  (1985)  for  the  surface  interpolation 
problem,  when  the  depth  value  /  is  known  at  some  set  Si  of  sites,  and  the  slope 
(Df)  at  a  different  set  S2). 

The  approach  we  have  been  discussing  —  which  we  will  call  the  "standard 
regularization  method"  is  very  attractive:  it  provides  a  unified  framework  for  the 
formulation  of  a  variety  of  problems,  and  it  leads  to  computationally  efficient 
algorithms.  However,  it  has  some  important  limitations  (some  of  them  pointed  out 
by  Poggio  and  Torre): 

(i)  Very  often  the  assumption  that  the  solution  /  is  smooth  over  the  whole 

domain  fi  is  not  justified.  What  is  more  commonly  true  is  that  fl  can  be 


partitioned  into  a  small  set  of  disjoint  connected  regions,  and  that  while  / 
is  smooth  in  the  interior  of  each  of  them,  it  has  discontinuities  along  the 
boundaries  between  regions  (which  in  turn  are  piecewise  smooth  curves). 
This  limitation  is  a  serious  one,  because  very  often  the  discontinuities  of 
/,  which  the  regularization  methods  tend  to  hide,  are  the  most  important 
parts  of  the  surface,  in  particular  if  one  is  trying  to  compute  a  symbolic 
representation  for  it. 

(ii)  The  meaning  of  the  parameters  of  the  energy  functional  is  not  always 
clear,  and  they  often  have  to  be  selected  on  a  purely  empirical  basis. 

(iii)  In  many  cases,  the  choice  of  the  particular  (often  quadratic)  form  for  the 
functions  A  and  P  is  arbitrary,  and  is  determined  mainly  by  the  tractability 
of  the  uniqueness  problem  for  the  solution,  and  by  the  simplicity  of  the 
(linear)  minimization  algorithm  (in  some  cases,  of  course,  there  may  be 
other  theoretical  or  experimental  considerations  that  justify  this  choice). 

(iv)  The  interaction  between  qualitatively  different  observations  is  purely 
additive.  One  would  like  to  be  able  to  include  more  realistic  non-linear 
modes  of  interaction. 


2.  Probabilistic  Formulation. 


A  different  approach  is  to  model  the  function  /,  whose  reconstruction  solves 
a  perceptual  problem,  as  a  random  field  that  has  to  be  estimated  from  a  set  of 
noisy,  and  possibly  ambiguous  measurements.  Within  this  formulation,  one  can 
adopt  a  Bayesian  viewpoint  (see  Good,  1983),  and  assume  that  the  best  way  of 
expressing  the  prior  knowledge  about  the  nature  of  the  solutionis  in  the  form  of  a 
(prior)  probability  distribution  Pj.  This  distribution,  together  with  a  probabilistic 
description  of  the  noise  that  corrupts  the  observations,  allows  one  to  use  Bayes 
theory  to  compute  the  posterior  distribution  Pj\9,  which  represents  the  likelihod  of 
a  solution  /  given  the  observations  g.  In  this  way,  we  can  solve  the  reconstruction 
problem  by  finding  the  estimate  /  which  either  maximizes  this  likelihood  (the  so 
called  Maximum  a  Posteriori  or  MAP  estimate),  or  minimizes  the  expected  value 
(with  respect  to  Pj\9 )  of  an  appropriate  error  function.  This  formulation  has  several 
advantages  over  the  "Standard  Regularization"  approach: 


1.  Flexibility. 


With  simple  modifications  in  the  prior  probabilistic  model  for  /,  one  can 
generate  algorithms  that  reconstruct  not  only  smooth,  but  piecewise  constant  or 
piecewise  continuous  functions.  It  is  also  possible  to  include  explicitly  into  the 
model  prior  knowledge  about  the  geometry  of  the  curves  that  bound  the  smooth 
patches  (i.e.,  about  the  discontinuities)  of  /. 

2.  Generality. 

This  approach  provides  a  general  framework  for  the  formulation  of  a  wide 
variety  of  perceptual  problems.  We  will  show,  for  instance,  how  it  can  be  used 
for:  image  segmentation;  surface  reconstruction  from  sparse  data;  modeling  of 
perceptual  grouping  processes;  stereo  matching,  etc.  Furthermore,  the  incorporation 
of  qualitatively  different  measurements  into  a  single  cooperative  estimation  process 

can  be  made  in  a  natural  way:  if  the  noise  processes  n,,  . . nM  associated  with 

the  sets  of  measurements  gu..  .gM  are  independent,  the  joint  posterior  distribution 
P{f  I  9u  •  ■  -9m)  will  be  simply: 


P(f  I  ?i.  •  ■  9m)  = 


P/(/)niliP(!>.i/) 
ri",  pm 


3.  Precise  Interpretation. 

The  parameters  that  appear  in  the  reconstruction  algorithms  that  are  derived 
using  this  approach  have  a  precise  statistical  interpretation  (for  example,  the  relative 
weight  of  the  evidence  provided  by  each  set  of  observations,  will  be  determined 
by  the  variance  of  the  associated  noise  process);  also,  the  plausibility  of  the 
prior  assumptions  about  the  behavior  of  the  solution  can  be  explicitly  verified 
by  generating  sample  functions  of  the  random  field  defined  by  P y,  by  means  of 
an  appropriate  Monte  Carlo  procedure.  Finally,  one  can  choose  the  precise  loss 
function  whose  expected  value  will  be  minimized  by  the  Bayesian  estimator. 

3.  Computational  Efficiency. 

As  we  will  see,  if  the  random  field  defined  by  Pf  is  Markovian  (i.e.,  if  the 
probabilistic  dependencies  are  local),  the  estimation  algorithms  will  be  distributed, 
so  that  it  will  be  possible  to  implement  them  efficiently  in  parallel  hardware. 


3.  Goals  of  this  Thesis. 


The  objective  of  this  work  is  to  apply  the  probabilistic  approach  we  have  just 
described  to  the  solution  of  a  general  class  of  perceptual  problems.  In  particular, 
we  will: 

1.  Present  a  class  of  random  fields  with  local  probabilistic  dependencies,  that  can 
be  used  very  effectively  to  model  the  behavior  of  a  wide  variety  of  functions. 

2.  Develop  appropriate  loss  functions,  and  the  corresponding  optimal  estimators  for 
different  classes  of  problems. 

3.  Develop  general  distributed  algorithms  for  computing  these  estimates. 

4.  Apply  the  above  results  to  several  specific  problems,  to  illustrate  the  generality 
and  practical  value  of  this  approach. 

5.  Develop  more  efficient  algorithms  for  each  of  these  particular  cases. 

We  now  present  a  list  of  our  main  contributions: 

3.1.  Summary  of  our  Main  Contributions. 

1.  Optimal  Bayesian  Estimators. 

Several  researchers  have  used  Bayes  theory  and  Markov  random  field  (MRF) 
models  for  the  restoration  of  piecewise  uniform  images.  It  has  been  implicitly 
assumed  by  most  of  them  that  the  maximization  of  the  posterior  probability 
(which  leads  to  the  Maximum  a  Posteriori  or  MAP  estimator)  is  the  best  possible 
performance  criterion.  We  introduce  the  use  of  different  specific  error  criteria 
for  the  design  of  the  optimal  Bayesian  estimators  for  several  classes  of  problems, 
and  propose  a  general  procedure  (which  is  based  on  some  existing  Monte  Carlo 
techniques,  such  as  the  Metropolis  algorithm)  for  approximating  them.  We  show, 
both  theoretically  and  experimentally  (in  particular  for  the  case  of  the  restoration  of 
piecewise  uniform  images)  that  this  new  approach  leads  to  a  substantial  improvement 


over  the  existing  methods,  both  regarding  the  quality  of  the  results  (particularly  for 
low  signal  to  noise  ratios)  and  the  computational  efficiency. 

2.  Novel  Applications. 

Throughout  this  thesis  we  present  several  examples  of  the  application  of  the 
probabilistic  approach,  and  of  the  optimal  estimation  procedures  that  we  have 
derived,  to  several  problems,  some  of  which  are  formulated  and  solved  in  these 
terms  for  the  first  time.  The  results  that  we  get  show  that  this  approach  can  provide 
a  unified  framework  for  the  integration  of  a  variety  of  related  perceptual  tasks  into  a 
single  cooperative  process.  Also,  these  results  represent,  in  several  cases,  a  significant 
improvement  over  those  obtained  using  existing  schemes.  Specifically,  these  new 
applications  are  the  following: 

a)  Surface  Interpolation. 

The  problem  of  reconstructing  a  piecewise  continuous  surface  from  sparse  and 
noisy  data  is  formulated  using  a  Bayesian  approach,  using  two  coupled  MRF’s 
to  model  the  behavior  of  the  smooth  patches,  and  of  the  curves  (discontinuities) 
that  bound  them.  Although  this  type  of  coupled  model  has  been  used  before 
(in  the  context  of  the  restoration  of  piecewise  uniform,  noisy  images),  its 
adaptation  to  this  problem  requires  some  non-trivial  modifications:  the  local 
interactions  between  the  elements  of  the  fields  have  to  be  redefined  in  an 
appropriate  way,  and  the  general  estimation  algorithm  has  to  be  modified  to 
make  it  computationally  feasible.  The  practical  value  of  the  resulting  algorithm 
is  illustrated  using  both  synthetic  and  real  data. 

b)  Signal  Matching. 

This  problem  consists  in  finding  the  corresponding  points  in  two  signals  when 
one  is  obtained  from  the  other  by  shifting  it  by  a  variable  amount.  We  study  in 
detail  a  specific  instance:  the  reconstruction  of  depth  from  a  stereoscopic  pair 
of  images,  and  show  how  to  formulate  it  using  our  general  framework.  The 
performance  of  the  algorithms  that  we  construct  is  also  illustrated  by  means  of 
synthetic  and  real  examples. 

c)  Formation  of  Perceptual  Clusters. 


We  suggest  that  the  process  of  formation  of  perceptual  clusters  of  certain  dot 
patterns  can  be  modeled  in  terms  of  the  estimation  of  binary  images  corrupted  by 
multiplicative  noise,  and  illustrate  the  application  of  our  estimation  algorithms 
to  this  task. 

3.  Efficient  Algorithms. 

Although  the  Monte  Carlo  procedure  that  we  have  developed  for  approximating 
the  optimal  estimates  is  perfectly  general,  for  each  particular  application  it  is  often 
possible  to  design  alternative  (some  times  deterministic)  algorithms  that  improve 
significantly  the  computational  efficiency.  It  has  been  our  concern  in  this  work  to 
develop  such  alternative  fast  algorithms  for  each  one  of  the  applications  that  we 
present  Specifically,  we  have  developed  the  following  algorithms: 

a)  Estimation  of  One-Dimensional  Signals. 

We  present  a  new  deterministic  algorithm  of  minimal  complexity  which 
computes  (exactly)  the  MAP  estimate  of  binary,  one-dimensional  MRFs,  and 
a  rigorous  proof  of  its  optimal  performance.  We  also  develop  an  alternative 
scheme  for  the  same  purpose,  based  on  dynamic  programming  principles, 
which  can  be  extended  to  handle  more  general  situations  (such  as  the  MAP 
estimation  of  piecewise  constant  one-dimensional  signals). 

I))  Estimation  of  Two-Dimensional,  Binary  MRF’s. 

We  heuristically  motivate  and  develop  a  new  deterministic  algorithm  for 
approximating  the  optimal  Bayesian  estimator  of  two-dimensional  MRF’s.  We 
find,  experimentally,  that  the  quality  of  the  results  produced  by  this  scheme  is 
equivalent  to  those  obtained  by  the  general  Monte  Carlo  procedure,  and  the 
computational  efficiency  (execution  time)  is  improved  at  least  by  an  order  of 
magnitude. 

For  the  case  of  the  MAP  estimation  of  binary  patterns,  we  develop  a  modification 
to  the  "Simulated  Annealing"  procedure,  which  improves  its  computational 
efficiency.  It  is  based  on  the  computation  of  "coarse  solutions"  (formed  by 
aggregating  the  elements  of  the  field  into  blocks)  which  are  then  progressively 


c)  Reconstruction  of  Piecewise  Continuous  Surfaces. 

In  this  case,  we  also  develop  a  heuristic,  deterministic  scheme  whose  experimental 
performance  is  practically  equivalent  to  that  of  the  Monte  Carlo  procedure, 
and  improves  significantly  on  its  computational  efficiency. 

d)  Stereo  Matching. 

We  propose  a  new  algorithm  for  solving  the  stereo  matching  problem  in 
some  simple  cases.  This  scheme  is  based  on  the  direct  implementation  of  the 
local  constraints  (generated  by  the  probabilistic  model)  in  a  highly  distributed 
cooperative  network  of  a  particular  form:  a  "Winner-Take-Air  network.  We 
show  rigorously  that,  for  noise-free  observations,  the  state  of  this  network  will 
converge  to  the  correct  solution,  and  estimate  the  maximum  number  of  required 
iterations  (which  is  usually  very  small).  The  application  of  this  technique  to 
the  reconstruction  of  the  depth  of  real  objects  from  stereoscopic  photographs 
is  discussed,  and  some  modifications  to  the  algorithm  are  introduced,  which 
permit  us  to  produce  results  whose  quality  is  comparable  to  those  of  other 
’’state  of  the  art”  algorithms. 

4.  Parameter  Estimation. 

In  the  context  of  the  estimation  of  two-dimensional,  binary  fields,  we  study  the 
case  where  the  parameters  that  characterize  the  field  model  and  the  noise  are  not 
known,  and  have  to  be  estimated  from  the  noisy  observations,  a  situation  that,  so  far, 
has  never  been  treated.  We  present  a  maximum  likelihood  procedure,  which  based 
on  an  analysis  of  the  residual  ("innovations")  process,  permits  the  simultaneous 
estimation  of  the  field  and  the  parameters  of  the  system.  We  apply  this  technique 
to  the  construction  of  an  algorithm,  which  does  not  have  any  free  parameters, 
for  the  reconstruction  of  piecewise  uniform  images,  and  perform  experiments  to 
demonstrate  its  performance. 

5.  Parallel  Implementations. 

An  important  issue  regarding  the  practical  value  of  the  algorithms  that  we 
develop  is  their  possible  implementation  in  certain  non-conventional  hardware, 


such  as  massively  parallel  digital  machines;  hybrid  and  analog  computers,  etc.  In 
this  connection,  we  make  the  following  contributions: 

a)  Monte  Carlo  Procedures. 

We  analyze  the  parallel  implementation  of  the  general  Monte  Carlo  procedure  for 
approximating  the  optimal  Bayesian  estimators.  We  show  that  the  convergence 
of  certain  widely  used  algorithms  (such  as  the  Metropolis  and  Heat  Bath 
schemes)  cannot  be  guaranteed  in  this  case.  We  justify  the  selection  of  an 
appropriate  algorithm  (the  "Gibbs  Sampler"),  and  present  an  estimate  of  its 
computational  complexity. 

b)  Reconstruction  of  Piecewise  Continuous  Surfaces. 

The  parallel  implementation  of  both  the  modified  Monte  Carlo  procedure 
and  the  deterministic  algorithm  that  solve  this  problem  are  analyzed,  and 
their  computational  complexity  is  estimated.  We  also  propose  schemes  for  the 
construction  of  hybrid  (digital/analog)  and  analog  networks  that  implement 
these  procedures,  and  perform  digital  simulations  to  evaluate  experimentally 
their  performance. 

c)  Estimation  of  Two-Dimensional  Binary  Fields. 

The  computational  complexity  of  the  parallel  implementation  of  the  fast 
deterministic  algorithm  that  performs  this  task,  is  estimated  and  compared  with 
that  of  the  general  Monte  Carlo  scheme. 

We  also  propose  the  adaptation  of  a  class  of  analog  networks  proposed  by 
Hopfield  and  Tank  (1985),  so  that  we  can  obtain  an  approximation  to  the 
optimal  estimate  of  the  field  from  the  equilibrium  state  of  this  system.  The 
performance  of  this  scheme  is  assessed  experimentally  by  means  of  numerical 
simulations. 

3.2.  Thesis  Overview. 

This  thesis  is  organized  in  the  following  way: 


In  chapter  two  we  will  introduce  the  basic  concept  of  a  Markov  random  field; 
show  how  to  compute  the  corresponding  probability  distribution,  and  present  Monte 
Carlo  procedures  for  generating  sample  functions.  In  chapter  three,  we  develop 
loss  functionals  for  the  image  segmentation  and  surface  reconstruction  problems, 
and  derive  the  corresponding  optimal  Bayesian  estimators.  We  also  present  general 
algorithms  for  computing  these  estimates,  and  discuss  their  implementation  in 
parallel  hardware. 

These  results  are  applied,  in  chapter  four,  to  the  problem  of  segmenting 
piecewise  constant  images  given  noisy  observations.  For  the  particular  case  of 
binary  images,  a  very  efficient  distributed  algorithm  is  developed,  and  we  present 
a  procedure  for  the  case  when  the  model  and  the  noise  parameters  are  not  known, 
and  have  to  be  estimated  from  the  noisy  data.  Also  in  this  chapter,  we  show  how 
dtese  principles  can  be  applied  to  the  problem  of  computing  the  perceptual  clusters 
that  are  formed  in  some  dot  patterns. 

In  chapter  five,  we  treat  the  problem  of  reconstructing  piecewise  smooth  surfaces 
from  sparse  and  noisy  data,  without  blurring  the  boundaries  between  continuous 
regions;  we  discuss  the  use  of  Markov  random  field  models  to  embody  the  prior 
knowledge  about  the  shape  and  location  of  the  discontinuities,  and  show  how 
to  adapt  the  general  reconstruction  algorithm^  developed  in  chapter  three  to  this 
problem.  We  also  develop  a  special  purpose  efficient  algorithm  for  this  case,  and 
discuss  its  parallel  implementation. 

Chapter  six  is  devoted  to  the  problem  of  the  reconstruction  of  depth  from 
stereoscopic  images.  As  in  the  previous  cases,  we  first  present  a  probabilistic 
formulation  of  the  problem,  and  extend  the  general  methods  of  chapter  three  for 
implementing  a  solution.  Then,  we  develop  special  purpose  algorithms  that  improve 
the  computational  efficiency.  The  performance  of  these  algorithms  is  illustrated 
using  both  synthetic  and  "real"  images. 

Finally,  in  chapter  seven,  we  summarize  our  results,  and  suggest  areas  where 
future  research  may  be  fruitful. 
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Chapter  2 


LOCAL  SPATIAL  INTERACTION  MODELS 


L  Introduction. 

The  key  to  the  success  in  the  use  of  the  probabilistic  (and  in  particular,  Bayesian) 
approach  for  the  solution  of  the  class  of  reconstruction  problems  in  which  we  are 
interested,  is  our  ability  to  find  a  class  of  stochastic  models  (that  is,  random  fields) 
that  have  the  following  characteristics: 

(i)  The  probabilistic  dependencies  between  the  elements  of  the  field  should 
be  spatially  localized.  This  condition  is  necessary  if  the  field  is  to  be  used 
to  model  surfaces  that  are  only  piecewise  smooth:  besides,  if  it  is  satisfied, 
the  reconstruction  algorithms  will  be  distributed,  and  thus,  efficiently 
implementable  in  parallel  hardware. 

(ii)  The  class  should  be  rich  enough,  so  that  a  wide  variety  of  qualitatively 
different  behaviors  of  the  desired  solutions  can  be  modeled. 

(ii:)  The  relation  between  the  parameters  of  the  models  and  the  characteristics 
of  the  corresponding  sample  fields  should  be  relatively  transparent,  so  that 
the  models  are  easy  to  specify. 

(iv)  It  should  be  possible  to  represent  the  prior  probability  distribution  Pf 
explicitly,  so  that  Bayes  theory  can  be  applied. 

(v)  It  should  be  possible  to  specify  an  efficient  Monte  Carlo  procedure  for 
generating  sample  fields  from  the  distribution,  so  that  the  ability  of  the 
model  to  represent  our  prior  knowledge  can  be  verified. 

Fortunately,  there  is  a  class  of  models  that  satisfies  these  characteristics:  the 
class  of  Markovian  Random  Fields  (MRF)  on  lattices.  We  will  describe  them  in 
this  chapter,  and  we  will  also  show  how  they  satisfy  the  required  conditions.  To 
do  this,  we  will  need  two  important  results:  the  Hammersley-CIifford  theorem, 
which  is  related  to  conditions  (iii)  and  (iv),  and  the  Metropolis  and  Gibbs-sampler 
algorithms,  which  will  permit  us  to  satisfy  condition  (v). 


2.  Markov  Random  Fields. 


The  concept  of  a  MR  F  is  a  direct  extension  of  the  concept  of  a  Markov  process 
to  higher  dimensions  and  originated  in  the  work  of  Ising  (1925)  on  the  construction 
of  models  for  ferromagnetic  phenomena.  The  definition  for  a  two  dimensional 
continuous  \1RF  was  introduced  by  Wong  (1968),  following  Levy  (1956)  (see  also 
Dobrushin,  1968),  and  in  intuitive  terms  it  says  that  a  random  field  is  Markovian 
if  for  any  closed  curve  that  separates  the  space  into  two  regions,  the  knowledge  of 
the  value  of  the  field  along  the  curve,  makes  the  field  in  these  regions  mutually 
independent 

More  useful  for  our  purposes  (since  usually  we  will  be  interested  only  in 
reconstructing  the  field  at  the  sites  of  a  regular  lattice)  is  the  definition  of  a  discrete 
MRF,  a  generalization  of  the  concept  of  a  Markov  chain.  A  discrete  Markov 
random  field  on  a  finite  lattice  is  defined  as  a  collection  of  random  variables,  which 
correspond  to  the  sites  of  the  lattice,  whose  probability  distribution  is  such  that 
the  conditional  probability  of  a  given  variable  having  a  particular  value,  given  the 
values  of  the  rest  of  the  variables,  is  identical  to  the  conditional  probability  given 
die  values  of  the  field  in  a  small  set  of  sites,  which  we  will  call  the  neighborhood 
of  the  given  site.  In  formal  terms  we  have  the  following  (see  Geman  and  Geman, 
1983,  and  also  Woods,  1972  for  an  alternative  definition): 

Let  S  be  a  finite  set  of  N  sites,  and  G  =  {G,,s  6  5}  be  a  neighborhood 
system  for  5,  i.e.,  a  collection  of  subsets  of  S  for  which: 

(i)  s  g  Ga  for  all  s  €  S. 

(ii)  s  E  Gr  if  and  only  if  r  £  Ga ,  for  all  r,  a  e  S. 

Let  F  =  {Fa,  s  €  5}  be  any  family  of  random  variables  indexed  by  s  e  S,  and 
suppose,  for  simplicity,  that  these  variables  take  values  on  some  finite  sets  (Qa) 
(the  definition  can  be  extended,  with  some  technical  modifications,  to  the  case  of 
continuous  state  space).  We  will  call  any  possible  sample  realization  /  : 
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Figure  2.  Sices  1.  2,  3  and  4  are  the  neighborhood  of  site  j 

a  configuration  of  the  field.  Let  Q  be  the  set  of  all  possible  configurations  (i.e..  the 
sample  space),  and  let  P  be  a  probability  measure  in  n.  F  is  a  MRF  with  respect 
to  G  if: 

(i)  P[F  =  /)  >  0,  for  all  /  €  H  (  (F  =  /)  denotes  the  event:  (F,  =  f,  for 
all  a  6  S)). 

(ii)  P{F,  =  /,  I  Fr  =  /,  r^3)  =  F(Fs  =  /8|Fr  =  /f  r  €  G.). 
for  every  s  €  S. 

It  is  clear  from  this  definition  ,  that  if  the  size  of  the  neighborhoods  is  small, 
a  MRF  will  satisfy  the  first  condition  we  required  from  our  class  of  models.  The 
direct  specification  of  a  MRF  from  this  definition  (i.e.,  in  terms  of  the  conditional 
probabilities),  however,  is  not  very  convenient  because  of  the  following  reasons: 

Firstly,  the  functions  defining  valid  conditional  distributions  for  a  MRF  cannot 
be  chosen  arbitrarily,  since  they  have  to  satisfy  a  set  of  consistency  conditions  (that 
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result  from  Bayes'  rule;  see  Besag,  1972),  and  are,  in  general,  very  difTieult  to  specify 
directly.  Secondly,  although  the  joint  probability  distribution  Pj  can  be  uniquely 
determined  from  the  conditional  probabilities,  its  compulation  is,  in  general,  a 
highly  non-trivial  task.  Finally,  there  is  no  obvious  intuitive  relation  between  the 
form  of  the  conditional  probability  distributions  and  the  qualitative  behavior  of  the 
sample  fields. 

To  overcome  these  difficulties,  we  need  an  alternative  way  of  defining  a  MRF. 
This  is  done  as  follows. 

I 

2.1.  Markov-Gibbs  Equivalence. 

First,  we  need  the  following  definition: 

| 

Given  a  system  of  neighborhoods  on  a  lattice,  we  define  a  "clique"  C  as  either 
a  single  site,  or  a  set  of  sites  of  the  lattice,  such  that  all  the  sites  that  belong  to  C  are 
neighbours  of  each  other.  For  example,  on  a  4-connected  lattice  (Fig.  2),  the  sites 

I  1,  2,  3  and  4  form  the  neighborhood  of  site  and  the  cliques  are  sets  consisting 

either  of  single  sites,  or  of  two  (vertically  or  horizontally)  adjacent  sites  (nearest 
neighbours;  see  Fig.  3). 

I  The  result  we  are  looking  for  is  contained  in  the  Hammersley-Clifford  theorem 

(Hammersley  and  Clifford,  1971)  which  states  that  if  F  is  a  MRF  on  a  lattice 
5  with  respect  to  the  neighborhood  system  G,  the  probability  distribution  of  the 
configurations  (sample  functions)  generated  by  it  will  always  have  a  definite  form, 

B  which  is  that  of  a  Gibbs  distribution: 

pad = rim 

;;  where  Z  is  a  normalizing  constant,  (3  is  a  parameter,  and  the  "Energy  function" 

U{f)  is  of  the  form: 


V(I)  =  E  vc(f) 


where  C  ranges  over  ihe  cliques  associated  with  the  given  neighborhood  system, 
and  the  potentials  Vc(f)  are  functions  supported  on  them.  Thus,  in  our  example  of 
u  4-connected  lattice,  U  would  be  of  the  form: 


W)-Ev.(/i)+  E  H(A,/y)+  E  VAIi.fi) 

*  »,j€Wv 

where  iV/,  and  AV  denote  the  sets  of  all  horizontal  and  vertical  nearest  neighbor 
pairs  of  sites  of  the  lattice  (figure  3  (b)  and  (c)),  respectively,  and  Va,  Vb  and  Ve  are 
some  functions. 

A  simple  proof  of  this  important  result  can  be  found  in  Besag  (1972).  We 
present  here  a  brief  sketch: 

Without  loss  of  generality,  we  may  assume  that  0  (the  configuration  with  /,•  =  0 
for  all  t)  belongs  to  H  (otherwise,  we  simply  perform  a  translation  of  the  origin). 


Since  F  is  a  MRF,  we  have  that 


so  that  the  quantity 


is  well  defined. 


IV) 

m 


The  key  step  is  to  note  that  we  can  always  write: 

=  tQ(f) 

P{  0) 


with 

W)  =  E  /.«.(/.)  +  E  E  Mfihifi.  />)  +  •  •  • 

•  »'  3 

•  ./nG,,.,(/, j  .  .  ./n) 


for  some  functions  G,,  GtJ, . . .. 

Now,  for  any  configuration  /  and  any  selected  site  t,  we  define  the  configuration 
/(*)  as  being  equal  to  /  everywhere,  except  possibly  at  site  i,  where  it  is  equal  to  0: 

f{i)  =  {/t,.../.--i,0,  /<+!,. ..,/«} 

Using  Bayes  rule  we  find  that: 

/>(/)  =  pifi  I  fj.i  ^  0  •  ^  *)  _ 

P(/W)  P(0|/y,;^t)P(/y,;>») 

—  I  fy  i  ^  *)  _  exp \Q(f)  -  Of/W) )  = 

=  exp[/<Gj(/,)  4-  E  fifjGijifi,  fj )  +  . . .] 
j 

Note  that  because  of  the  Markov  property,  the  above  quotient  of  conditional 
probabilities  can  depend  only  on  the  value  of  /  at  those  sites  which  are  neighbors 
of  site  i. 

Now,  suppose  l  is  not  a  neighbor  of  t,  and  consider  a  particular  configuration  / 
which  is  equal  to  0  everywhere,  except  at  sites  i  and  l.  By  the  above  considerations, 
we  have  that: 


«(/)  -  WW)  =  /.C.(/.)  +  fl) 


depends  only  on  /,,  which  means  that  C, /(/,-,  /,)  =  0. 

By  a  similar  reasoning,  one  can  show  that  G.j,... /„,)  can  be  different 
from  0  only  if  the  sites  t,  j,  ...,m  arc  neighbors  of  each  other,  i.e.,  if  they  belong  to 
the  same  clique.  The  proof  is  completed  by  defining: 

frn )  =  . .  ./mGtr.m(/ii  •  •  • fm ) 

It  is  important  to  note  that  whereas  the  functions  defining  valid  conditional 
probabilities  for  a  MRF  cannot  be  chosen  arbitrarily,  the  form  of  the  potentials 
Vc  is  not  restricted  in  any  way,  and  can  be  used  freely  to  specify  the  required 
behaviour  of  the  field  /  (which  is  what  one  does  in  practice).  The  relation  between 
these  potentials  and  the  conditional  probabilities  is  given  by  the  following  formula 
(which  follows  from  Bayes  rule): 

,  ,  „  ,  ,  ,  e*p|—  j  SC:i6C  Vc(/)l 

(.  /.I,  f„l^)  e*p|-i  Eaec  Vfc(/»)J  () 

where  Qt  is  the  set  of  allowable  values  for  the  state  of  F„  and  fq  is  the  configuration 
which  is  equal  to  q  at  site  i,  and  coincides  with  f  everywhere  else. 

There  are  other  ways  of  representing  certain  classes  of  MRF’s.  For  example. 
Woods  (1972)  has  shown  that  every  homogeneous  Gaussian  MRF  defined  on  a 
finite  lattice  satisfies  a  difference  equation  of  the  form: 

fnm  —  )  1  ^ktfn—k,m—l  "1"  umn 

D(P) 

where  /nm  is  the  value  of  the  field  at  site  nm  and  u  is  a  (non-white)  stationary 
Gaussian  field  whose  autocorrelation  function  satisfies: 

c,  m  =  n  =  0 

^[Unm^Oo]  =  1  hmnC,  (m,Tl)  £  D(P) 

.0,  elsewhere 


where 


D(P)  =  {{k,l)  :  0<fc2  +  /2<P2} 
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and  also 


lo, 


\f  n  —  k  and  m  =  l 
otherwise 


the  numbers  hkt  can  be  interpreted  as  the  coefficients  of  the  linear  minimum  mean 
square  error  estimator  of  fmn  given  its  neighbors  out  to  distance  P,  and  u  as  the 
estimation  error. 

This  representation  (called  a  "Conditional  Markov"  (CM)  model  by  Kashyap 
(1983))  can  then  be  used  to  generate  sample  functions  (Woods,  1972  also  presents 
an  algorithm,  based  on  the  discrete  Fourier  transform,  for  the  generation  of  sample 
realizations  of  the  field  u,  and  for  the  computation  of  the  joint  distribution  for  /). 
A  field  that  satisfies  a  difference  equation  of  the  form: 

fnm  —  )  ]  h klfn—k,m—l  3”  wnm 
D(P) 

where  {wnm}  are  independent  random  variables,  is  called  a  "Simultaneous 
Autoregressive"  (SAR)  model  by  Kashyap  (  a  similar  representation  can  be  obtained 
for  fields  with  exponential  autocorrelation  functions:  see  Habibi,  1972).  Although 
it  is  claimed  that  for  any  homogeneous  SAR  model  it  is  possible  to  find  a  MRF 
with  the  same  spectral  density,  albeit  with  a  different  neighborhood  structure,  it 
is  in  general  very  difficult  to  compute  the  joint  distribution  explicitly  from  the 
SAR  representation.  On  the  other  hand,  the  Gibbs  representation  has  the  following 
advantages: 

(i)  ft  is  perfectly  general:  it  applies  to  discrete  valued  fields,  and  it  can  be 
easily  generalized  to  the  case  of  continuous  valued  ones. 

(ii)  It  is  easy  to  generate  sample  functions  from  the  distribution  (we  will  discuss 
algorithms  for  doing  this  in  the  next  section). 

(iii)  Since  the  posterior  distribution  is  also  a  Gibbs  measure,  the  optimal 
estimates  can  be  obtained  directly  from  the  posterior  energy  function. 

For  these  reasons,  this  is  the  representation  that  we  will  adopt 
3.  Generation  of  Sample  Configurations  of  MRFs. 

3.1.  The  Metropolis  and  Gibbs-Samplcr  Algorithms. 
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The  earliest  successful  Montecarlo  procedure  for  the  generation  of  sample 
functions  of  MRF's  was  developed  by  Metropolis  et  al.  (1953)  for  the  numerical 
computation  of  thermodynamic  properties  of  many -particle  systems  in  thermal 
equilibrium.  To  describe  it,  let  us  consider  a  system  with  S  particles,  each  of  which 
may  be  in  any  one  of  a  finite  number  of  allowable  states.  Let  f}  denote  the  state  of 
the  jth  particle  (we  will  refer  to  Lite  N  -  vector  /  as  the  global  configuration  of  the 
system),  and  let  U[f )  be  the  corresponding  energy. 

The  basic  idea  of  the  algorithm  is  to  construct  a  Markov  chain  whose  states 
correspond  to  the  global  configurations  of  the  system  at  discrete  time  intervals 
1 , . . n.  It  is  a  well  known  fact,  from  statistical  physics,  that  when  the  physical 
system  is  at  thermal  equilibrium  at  a  given  temperature  T,  its  configurations  will  be 
distributed  according  to  the  Gibbs  measure: 

'(/)=|«xp|-^l  (2) 

Therefore,  we  want  n (/)  to  be  the  invariant  measure  for  our  chain.  If  the  chain  is 
regular  (  i.e..  if  it  is  possible  to  go  between  any  two  states  in  some  fixed  number  of 
steps),  n(f)  will  be  the  unique  vector  satisfying: 

nPc  —  n 

where  Pc  is  the  transition  matrix  of  the  chain  (see  Kindermann  and  Snell,  1980). 

Also,  since  a  system  in  equilibrium  looks  the  same  if  we  reverse  the  time 
direction,  we  require  that  the  associated  chain  be  reversible,  that  is, 

p r(/(n  +  1)  =  j  |  /(n)  =  i)  =  Pr(/(n  -  1)  =  j  \  f{n)  =  t) 

For  a  regular  chain,  reversibility  is  equivalent  to  the  "detailed  balance"  condition: 

f')  =  *(/’)Pc(/\  /)  (3) 

where  /  and  /’  are  any  two  global  configurations.  This  condition  means  that,  if 
we  consider  a  large  collection  of  isolated,  identical  systems,  each  one  in  thermal 


29 


equilibrium  at  the  same  temperature  (the  so  called  "Canonical  Ensemble"),  the 
number  of  systems  going  from  state  /  to  /’  must  equal  the  number  of  systems  going 
from  /’  to  /.  This  condition  is  also  sufficient  for  the  convergence  of  the  chain  to 
the  desired  Gibbs  measure. 

The  algorithm  proposed  by  Metropolis  generates  a  regular  chain  that  satisfies 
(3).  It  is  as  follows: 

Suppose  that  we  visit  the  particles  of  the  system  (i.e.,  the  sites  of  the  lattice)  in 
some  random  sequential  order  (for  example,  we  choose  the  next  site  to  be  visited  at 
random  with  uniform  distribution).  When  a  particle  j  is  visited,  we  update  its  state 
as  follows: 

(i)  Choose  a  new  state  7y  randomly  from  the  set  of  allowable  states  using  a 
uniformly  distributed  random  number. 

(ii)  Compute  the  increment  in  energy  AEy  that  results  from  moving  the  state 
of  the  jth  particle  from  /y  to  f}. 

(iii)  If  AEy  <  0,  make  the  move,  i.e.,  set  /y  =  /y. 

If  A£y  >  0,  generate  a  new  random  number  r,  uniformly  distributed 
between  0  and  1. 

If  r  <  e~AE>/T,  set  /y  =  /y. 

If  r  >  e~^K>!T ,  leave  /y  unchanged. 

If  we  denote  by  q{f,f)  the  probability  of  proposing  the  state  f  when  the 
system  is  at  state  /  (i.e.,  the  probability  of  visiting  particle  and  selecting  the  state 
7y  for  it;  note  that  q  must  be  a  symmetric,  irreducible  stochastic  matrix,  so  that 
q[f,  'f)  =  q[f,  /),  by  construction),  we  have  that 

Pc[f ,  /)  =  <?(/.  7)  min(l,  e-Af//r) 

PcCf,  f)  =  <?(/./)  m in(l,eA{//r) 


where 

AC/  =  U(f)  -  U(f) 


Therefore,  if  AC/  <  0, 

Pc(f,  f)  =  q{f ,  f)  and  Pc(f ,  f)  =  q(f ,  f)‘AU/T 
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and  if  A U  >  0, 


Pelf,  1)  =  ?(/,  /)e-i,,/T  and  Pei), !)  =  ?(/,  /) 


Clearly,  in  both  cases,  (3)  is  satisfied. 

This  is  not  the  only  chain  that  satisfies  (3).  Another  possibility  is  to  set: 

*(/) 


Pc(f,f)  =  ?(/,/) 
=  ?(/./) 


*(/)  +  *(/) 

1 


1  +eAU/r 


in  which  case  we  get  the  "heat  bath"  algorithm  (see  Gidas  (1984)  and  Hastings 
(1982)). 

A  different  construction,  called  the  "Gibbs  sampler"  has  been  proposed  by 
Geman  and  Geman  (1983)  (see  also  Besag  (1972)).  In  this  scheme,  too,  at  each 
iteration  only  one  site  is  modified;  its  new  state,  /y  is  selected  at  random  from  the 
conditional  distribution  given  by  equation  (1).  These  authors  show  that  provided 
only  that  we  keep  visiting  every  site,  (i.e.,  that  we  update  its  state  "infinitely  often") 
the  resulting  chain  is  ergodic,  and  its  invariant  measure  is  given  by  (2)  (note  that 
reversibility  is  not  required  in  this  case).  It  is  not  difficult  to  see  that  for  binary 
systems  this  method  is  equivalent  to  the  heat  bath  algorithm. 


3.2.  Statistical  Mechanics  Interpretation. 


To  get  an  intuitive  grasp  on  the  way  these  algorithms  work,  it  is  useful  to 
recall  some  results  from  statistical  mechanics  (see,  for  example,  Reif,  1965).  When 
a  macroscopic  system  (i.e.,  a  system  with  a  large  number  of  degrees  of  freedom) 
is  in  thermal  equilibrium  at  a  given  temperature  T,  its  state  /  will  be  such  that 
the  Gibbs  free  energy  F  is  minimized.  The  relation  between  F(f)  and  the  internal 
energy  U(f)  of  the  system  is  given  by: 

F(f)  =  U(f)  -  TS 
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where  the  entropy  S  is: 


S  =  In  n  (C/) 


and  Q(Cf)  is  the  total  number  of  feasible  configurations  of  the  system  with  energy 
equal  to  U. 

From  this  relation  it  is  clear  that  at  high  temperatures,  a  system  in  equilibrium 
will  adopt  a  disordered,  high  energy  configuration  (which  will  have  a  high  value  of 
5),  while  at  low  temperatures,  the  dominant  tendency  will  be  towards  low  energy 
states.  The  probability  distribution  of  the  equilibrium  energy  is  given  by: 

Pu(U )  =  \t~UITn{U) 

where  Z  is  a  constant.  Since  fl(-)  is  a  rapidly  increasing  function  of  U,  and  the 
negative  exponential  is  rapidly  decreasing,  Pu  will  be  sharply  peaked  around  a 
value  U*[T).  Using  the  fact  that  fl(C/)  =  0{Un),  where  n  is  the  number  of  degrees 
of  freedom  of  the  system,  one  can  show  that  the  relative  width  AC/  of  this  peak  will 
be  inversely  proportional  to  the  square  root  of  n: 


AC/  1 


(This  result  holds,  in  fact,  not  only  for  the  energy,  but  for  other  related 
thermodynamical  properties  as  well).  This  means  that,  for  large  n,  the  Metropolis 
(or  Gibbs  sampler)  chain  will  generate  (asymptotically)  configurations  whose  energy 
is  very  close  to  U*(T),  which  is  an  increasing  function  of  T. 

To  illustrate  this,  let  us  consider  a  binary  system  on  a  four-connected  square 
lattice,  whose  energy  function  is  given  by: 

W)  =  s£Vc(/../y) 

1  c 


with 


VcUi.fi)  =  {j  *’ 


if  /.  =  /, 

otherwise 


where  C  ranges  over  all  the  nearest  neighbor  cliques  of  the  lattice  (this  is  the  two 
dimensional  Ising  model  with  "free  boundaries"  —  since  the  only  interactions  that 
contribute  to  the  energy  are  those  between  elements  of  the  field  that  belong  to  the 
lattice  — -  which  we  will  later  discuss  in  detail). 


In  figure  4  we  present  typical  equilibrium  configurations  generated  at  three 
different  temperatures  using  the  Metropolis  algorithm  with  random  updating  order. 
The  temperatures  used  correspond  to  0.8, 1.0  and  1.2  times  the  critical  temperature 
for  this  model  (the  critical  temperature  is  defined  as  the  maximum  value  of  the 
temperature  for  which  the  effect  of  fixed  conditions  at  the  boundary  of  a  square 
lattice  is  felt  at  the  center,  no  matter  how  large  the  lattice  is.  For  the  two-dimensional 
Ising  model  it  equals  1273). 

In  the  limit  of  very  large  lattices,  the  equilibrium  energy  per  spin  (which  is 
proportional  to  the  total  length  of  the  boundaries  between  "black"  and  "white" 
regions)  is  given  by  (see  Wannier,  1959): 

^-22  =  -Hcothiii  ±  ?(i  -aJ)'/»K-(c)] 


where  we  take  the  +  or  -  sign,  above  and  below  Te,  respectively,  k  is  the  Boltzmann 
constant:  a  is  given  by: 

—  2s*nh(l/r) 

“  “  cosh2(l/T) 

and  K(  )  is  the  complete  elliptic  integral  of  the  first  kind  (see,  for  example, 
Hildebrand,  1976). 

The  average  density  of  "black"  elements  can  be  computed  by  the  expression: 

CAT)  =  |(1  +  smh’d/n  -  l)]V* 

2  sinh  (1/r) 


The  shape  of  these  functions  is  illustrated  in  figure  5. 


From  a  qualitative  viewpoint,  one  can  see  that  the  temperature,  which  is  the 
only  free  parameter  of  this  model,  controls  the  granularity  (average  cluster  size  and 
cluster  density)  of  the  sample  patterns. 


Other  examples  of  patterns  generated  with  these  algorithms  (or  some  variations 
of  them)  may  be  found  in  Cross  and  Jain  (1983)  and  Hassner  and  Sklansky  (1980), 
where  they  are  used  as  models  for  texture:  in  Geman  and  Geman  (1983)  as  models 
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for  piecewise  constant  images,  and  in  Grcnander  (1983),  where  they  are  used  to 
produce  more  complex  patterns. 

3.3.  Continuous  Valued  State. 

Any  of  the  two  algorithms  presented  in  section  3.1  can  be  generalized  to  the 
case  where  the  state  of  each  particle  can  take  any  real  value  on  a  compact  set 
(e.g.,  a  closed  interval)  at  the  expense  of  their  computational  efficiency.  A  different 
approach  that  seems  promising  is  based  on  the  fact  that  a  vector  /  which  obeys  the 
stochastic  differential  equation: 

df  =  -grad  U{f)dt  +  sfTTdvj  (4) 

where  u;  is  a  vector  Wiener  process  with  unit  variance  (a  collection  of  independent 
Brownian  motion  processes),  will  be,  under  suitable  smoothness  conditions  on 
U,  distributed  asymptotically  (as  t  f  oo)  according  with  the  Gibbs  measure  (1) 
(see  Grenander  1984;  Geman  and  Hwang,  1984).  This  means  that  we  can  use  a 
numerical  simulation  of  (4)  (see  Wong  and  Zakai  (1965))  to  generate  the  desired 
patterns.  This  approach  has  two  interesting  advantages,  that  result  from  the  fact  that, 
in  a  numerical  simulation,  the  increments  dw  are  approximated  by  independent, 
identically  distributed  Gaussian  random  variables: 

(i)  We  only  need  to  generate  Gaussian  random  numbers,  for  which  efficient 
algorithms  exist 

(ii)  All  sites  can  be  updated  at  the  same  time,  so  that  efficient  parallel 
implementations  can  be  adopted. 

The  probability  distribution  of  the  configurations  generated  by  the  system  at 
any  given  time  can,  in  principle,  be  obtained  by  solving  an  appropriate  system 
of  partial  differential  equations  (i.e.,  the  Kolmogorov  equations;  see  for  example, 
Karlin  and  Taylor,  1981);  this  will  not  be  practical  in  most  cases,  however,  so  that 
the  rate  of  convergence  of  this  algorithm  will  have  to  be  assessed  in  an  experimental 


We  will  now  describe  how  an  extension  of  the  techniques  presented  in  this 
section  can  be  used  to  find  the  global  minimum  of  arbitrary  energy  functionals. 
As  we  will  show  in  the  next  chapter,  this  method  will  be  particularly  useful  for 
minimization  in  the  variational  principles  which  represent  the  Maximum  a  Posteriori 
estimated  solution  to  a  reconstruction  problem. 

4.  Simulated  Annealing  and  Global  Minimization. 

Simulated  annealing  is  a  new  technique,  developed  by  Kirkpatrick  et  al  (1983) 
for  the  solution  of  combinatorial  optimization  problems.  It  is  based  on  the  idea 
that  any  cost  functional  of  N  variables,  each  of  which  can  take  values  on  some 
finite  set,  can  be  considered  as  the  energy  function  of  a  physical  system  whose  state 
corresponds  to  a  particular  value  of  these  variables.  Therefore,  we  can  use,  say, 
the  Metropolis  algorithm  to  generate,  at  any  given  "temperature"  T  (which  now 
becomes  a  parameter  of  the  optimization  process)  samples  from  the  corresponding 
Gibbs  measure.  Since  as  T  1  0  this  measure  converges  to  an  impulse  (or  set  of 
impulses)  corresponding  to  the  state  (or  states)  of  minimum  energy,  the  state  of  the 
system  in  thermal  equilibrium  at  zero  temperature  will  correspond  to  the  value  of 
/  that  minimizes  U(f)  globally. 

One  serious  difficulty,  however,  is  that  attaining  thermal  equilibrium  might  take 
a  very  long  time  at  low  temperatures.  Kirkpatrick’s  idea  was  to  start  at  a  relatively 
high  temperature  (where  thermal  equilibrium  is  reached  very  fast),  and  then,  to 
slowly  cool  the  system,  until  "freezing"  occurs  and  the  state  stops  changing. 

4.1.  Discrete  Valued  State. 

Gerr.an  &  Geman  (1983)  were  able  to  show  that  if  the  temperature  is  lowered 
at  the  rate: 

r=M^T)  (5) 

where  n  is  the  number  of  iterations,  and  C  is  a  constant,  this  algorithm  (using  the 
Gibbs  sampler)  will  in  fact  converge  (in  probability)  to  the  set  of  states  of  minimal 


energy.  They  also  showed  that  this  chain  is  asymptotically  ergodic  in  the  sense  that 
for  any  real  valued  function  Y  of  the  global  state  at  time  t,  f[t),  we  have: 

ii,™ t  nm  =  L  YMir,(U) 

njoo  n  -'ll 

where  ft  is  the  set  of  allowable  global  states.  This  means  that  we  can  use  time 
averages  to  estimate  ensemble  averages.  Similar  results  have  been  obtained  by  Gidas 
(1984)  for  the  Metropolis  and  heat  bath  algorithms. 

The  minimal  value  of  the  constant  C  in  equation  (5)  for  which  convergence 
can  be  guaranteed  hits  not  been  determined  in  general.  The  value  found  by  Geman 
and  Geman  is: 

C  =  NA 

where  N  is  the  total  number  of  sites  in  the  lattice,  and  A  is  the  largest  absolute 
difference  in  energies  associated  with  pairs  of  global  configurations  that  differ  at 
only  one  site.  This  value,  however,  is  too  large  to  be  of  any  practical  use  in  most 
applications.  Gidas  (1984)  has  shown  that  if  U  has  not  more  than  two  local  minima, 
C  can  be  computed  as: 


where  A’  is  the  minimal  energy  change  between  a  local  minimizer  and  a  neighboring 
(in  the  sense  that  it  differs  at  exactly  one  site)  configuration.  He  also  conjectures 
that  this  expression  holds  in  general,  but  this  result  has  not  been  confirmed. 

In  a  recent  paper,  White  (1984)  characterizes  the  initial  annealing  temperature 
in  terms  of  the  standard  deviation  of  the  "density  of  states"  (the  number  of  possible 
states  of  the  system,  per  unit  energy,  for  each  value  of  the  energy)  when  this  function 
is  approximately  Gaussian  (which  seems  to  be  the  case  for  a  large  class  of  systems). 
In  some  particular  cases  this  value  can  be  determined  analytically  from  the  structure 
of  the  problem,  but  in  general,  it  has  to  be  computed  numerically  from  a  simulation 
of  the  system  at  high  temperature. 

For  the  class  of  systems  in  which  we  are  interested,  we  have  found,  by  a  trial 
and  error  procedure,  that  a  value  of  C  equal  to  1.5  times  the  natural  temperature  of 


the  system  (  i.e.,  the  temperature  associated  with  the  Gibbs  distribution  of  the  prior 
MRF  model)  produces  a  reasonable  convergence  behaviour  (of  the  order  of  500 
iterations),  but  clearly,  more  research,  both  theoretical  and  experimental  is  needed 
in  this  area. 

Another  important  factor  which  determines  the  computational  efficiency  of 
simulated  annealing  is  related  to  the  difficulty  in  computing  the  increment  in  energy 
A Uj  associated  with  a  change  in  the  state  of  the  jth  variable.  If  the  energy  function 
comes  from  the  probability  measure  of  a  MRF,  the  computation  of  A  Uj  will  require 
only  the  states  of  the  variables  in  the  neighborhood  of  j.  Suppose  now  that  we  color 
the  sites  of  the  lattice  in  such  a  way  that  any  two  neighbors  will  always  be  of  different 
color.  In  a  parallel  implementation  we  can,  in  principle,  update  the  states  of  all 
the  sites  that  are  of  the  same  color  in  a  simultaneous  way.  The  minimum  number 
of  colors  needed  to  satisfy  this  condition  is  called  the  "Chromatic  Number"  of  the 
graph  that  describes  the  neighborhood  structure  of  the  MRF,  and  it  is  bounded 
below  by  the  size  of  the  largest  clique  of  the  system.  This  number,  then,  determines 
the  minimum  number  of  steps  that  are  needed  in  a  parallel  machine  to  update  the 
state  of  the  whole  lattice.  We  will  analyze  these  implementations  in  more  detail  for 
some  particular  examples  in  the  next  chapters. 

4.2.  Continuous  Valued  State. 

All  the  available  convergence  results  for  the  annealing  algorithm  hold  only  for 
the  case  where  the  set  of  allowable  values  for  the  state  of  each  variable  is  finite.  If 
this  set  is  infinite,  but  compact,  we  can  still  use  these  results  to  find  approximate 
solutions  by  discretizing  it.  However,  the  computational  complexity  will  increase 
as  we  increase  the  resolution  of  this  discretization.  An  attractive  alternative  is  to 
generalize  the  approach  discussed  in  section  2.2  by  making  T  in  equation  (4) 
time  dependent  A  convergence  proof  for  this  modified  scheme,  for  smooth  energy 
functions  that  satisfy  appropriate  boundary  conditions,  can  be  found  in  Geman  and 
Hwang,  1984. 


5.  Discussion. 


We  have  presented  a  class  of  probabilistic  models  with  local  dependencies 
which  can  represent  prior  generic  knowledge  about  the  solution  of  a  reconstruction 
problem:  the  class  of  MRF’s  on  Unite  lattices.  We  have  seen  how  they  can  be 
completely  specified  by  defining  arbitrary  "potential  functions"  which  are  supported 
on  the  cliques  of  the  associated  neighborhood  system.  It  is  thus  easy  to  define 
families  of  fieHs  with  a  wide  range  of  different  behaviors.  For  example,  if  the  only 
prior  knowledge  that  we  have  is  that  the  reconstructed  surface  should  be  piecewise 
constant,  we  may  use  a  4-connected  lattice  with  Ising  potentials: 

-1,  if  |»-  j|  =  1  and  U  =  ft 
Vc{fi,fj)  —  l,  if  K-j|  =  land/,^/y 

0,  otherwise 

In  this  case,  the  natural  temperature  of  the  system  will  index  a  one  parameter 
family  of  fields  with  varying  degrees  of  granularity. 

Smooth  surfaces  can  be  modeled  using  the  same  neighborhood  system,  but 
with  quadratic  potentials: 


=  P  /j)2’ 

lo. 


if  | i  -j |  =  l 
otherwise 


More  complicated,  non-isotropic  patterns  can  also  be  modeled,  using  slightly 
larger  neighborhoods  (as  in  Cross  and  Jain,  1983).  Also,  as  we  will  see  in  chapter 
5,  an  appropriate  choice  of  the  lattice  and  the  neighborhood  system,  permits  one 
to  use  a  MRF  to  model  sets  of  piecewise  smooth  curves  on  the  plane.  Using  this 
construction,  it  is  possible  to  model  the  behavior  of  a  piecewise  smooth  function 
defined  on  a  two-dimensional  lattice  (a  "piecewise  smooth  surface")  by  coupling 
two  MRF’s:  one  for  the  smooth  portions,  and  another  for  the  curves  that  bound 
them. 

We  showed  how  the  probability  distribution  of  the  configurations  generated  by 
a  MRF  has  the  same  form  as  the  one  associated  with  a  macroscopic  physical  system 
in  thermal  equilibrium,  so  that  one  can  use  Monte  Carlo  procedures  that  simulate 


39 


the  behavior  of  such  systems  to  generate  sample  functions  of  arbitrary  MRF's.  The 
Markovian  property  of  the  models  imply  that  the  computations  performed  by  these 
procedures  are  local  in  nature  (the  updating  rule  for  each  site  depends  only  on 
the  states  of  its  neighbors),  so  that,  in  principle,  efficient  parallel  schemes  can  be 
designed  for  their  implementation.  We  will  examine  this  question  in  detail  in  the 
next  chapter,  where  we  discuss  the  use  of  MRF  models  and  Bayes  theory  for  the 
optimal  solution  of  reconstruction  problems. 


Chapter  3 


OPTIMAL  BAYESIAN  ESTIMATORS 


1.  Introduction. 

The  use  of  the  Bayesian  approach  for  the  solution  of  reconstruction  problems 
requires  the  development  of  the  following  items: 

(i)  A  prior  probabilistic  model  for  the  functions  to  be  reconstructed. 

(ii)  Stochastic  models  for  the  observation  processes. 

(iii)  Appropriate  loss  (error)  criteria. 

(iv)  Estimators  that  are  optimal  with  respect  to  (i),  (ii).  and  (iii). 

(v)  Efficient  algorithms  for  the  computation  of  these  estimates. 

In  the  previous  chapter,  we  discussed  item  (i),  and  presented  a  class  of 
probabilistic  models  that  can  be  used  very  effectively  to  encode  prior  generic 
constraints  about  the  solutions  of  reconstruction  problems.  In  this  chapter  we  will 
develop  the  remaining  necessary  ingredients  that  are  necessary  to  perform  optimal 
reconstructions  in  the  general  case. 

First  of  all,  let  us  formulate  the  class  of  problems  of  interest  in  a  precise  way, 
and  present  a  general  stochastic  model  for  the  observation  process. 

2.  Problem  Formulation. 

We  mentioned  in  chapter  1  that  there  is  an  important  class  of  perceptual 
problems  whose  solution  can  be  found  by  reconstructing  a  function  /  :  Rn  *-*■ 
Rm  on  a  finite  set  of  points  that  lie  inside  a  compact  domain  fl  C  Rn.  Although 
the  methods  that  we  will  develop  are,  in  principle,  perfectly  general,  for  the  sake  of 
clarity  we  will  confine  ourselves  to  the  important  particular  case  when  n  =  2  and 


m  =  1.  We  are,  therefore,  interested  in  reconstructing  the  value  of  a  function  /  at 
each  one  of  the  ,V  sites  of  a  lattice  L  (wc  will  denote  the  value  of  the  function  at 
site  i  e  L  by  /,). 

2.1.  Stochastic  Model  for  the  Observations. 


Let  us  assume  that  we  have  a  set  of  observations  j  on  a  subset  S  of  the  sites 
of  L,  and  that  the  process  by  which  these  observation  are  obtained  can  be  modeled 
by: 

Si  =  *(»,(/).",)  .  J€S  (1) 

Here  ,  J/y()  is  an  operator  with  local  support  that  represents  some  kind  of  (in 
general  non-invertible)  degrading  operation  (such  as  blurring);  ♦  is  an  operation 
invertible  with  respect  to  ny  (so  that  ny  =  4»_1(py,i/y (/)));  it  may  represent, 
for  example,  noise  addition  or  multiplication  followed  by  a  memoryless  non¬ 
linear  transformation,  ny  represents  a  scalar  noise  process  with  known  probability 
distribution  P„ y.  We  will  assume  that  ny  is  independent  of  ny,  for  all  t  ^  j,  and 
also  that  it  is  independent  of  /. 


Given  /,  the  conditional  probability  distribution  for  the  observations  Pb)/  will 
be  given  by: 

P,i/(»;/)=  n 

*es 

Assuming  that  P„y(ny)  >  0  for  all  i,  and  all  possible  values  of  ny,  we  can  define 
the  functions  *,  by: 


*,(/,?.)  =  - In  Pm(^-l(gy, //,(/)) 


(2) 


so  that  we  can  write  the  conditional  distribution  as: 

l 

P,\f{91  f)  =  exp(-  Y.  ♦.-(/.  9*)\  (3) 

»'€S 


As  an  example,  consider  the  case  of  additive  ,  zero  mean  white  Gaussian  noise.  We 
have: 


HAf)  =  /, 


b)  =  a  +  b 


Pni(x)  =  — CXp[-I2/2 C2} 

y/2n<r 

Pg\f{9 ;  f)  =  n  — =-  «p[-(/«  -  9i?/2o2)  = 

ies  \p2/na 

=  exp[-  £{ln(>/2^a)  +  /,  -  g,)2}] 


2.2.  Posterior  Probability  Distribution. 


Since  we  are  using  a  MRF  model  for  the  field  /,  its  prior  distribution  will  be 
of  the  form: 

W)=  (<) 


with 

W)  =  EM/) 

c 


where  C  ranges  over  the  cliques  of  the  neighborhood  system  of  /. 
Using  Bayes  rule,  we  find  that  the  posterior  distribution  is: 

PfV)P,\f(9;f) 


Pf\9(f,g)  = 


Pt(9 ) 


Using  the  expressions  (3)  and  (4)  for  Pj  and  Pt\ /*  and  recognizing  that  P„(g)  is 
a  constant  for  a  given  set  of  observations,  we  get  that  the  posterior  probability  will 
also  follow  a  Gibbs  distribution: 


Pf\g{f\  9)  —  exp[-C JP(f;  g)]  (5) 

with 

tW;»)=i-M>(/)+E  *(/.»,)  («) 

■*0  ieS 


Where  Z>>  is  a  constant,  and  the  functions  are  defined  by  (2). 


We  can  now  provide  a  physical  interpretation  of  the  posterior  distribution,  by 
considering  that,  while  the  prior  distribution  (4)  describes  the  behavior  of  a  free  field 
in  thermal  equilibrium  (see  section  3.2  of  chapter  2),  the  distribution  (5)  describes 
the  behavior  of  the  same  field  coupled  with  a  fixed  (but  spatially  varying)  external 
field  whose  value  is  given  by  g.  The  functions  <!>,,  whose  magnitude  depends  on  the 
noise  variance,  can  then  be  interpreted  as  the  coupling  strengths  between  the  two 
fields.  This  coupled  system  is  also  Markovian,  and  if 

//,(/)  =  //,(/,)  for  all  i  €  S 

its  neighborhood  structure  will  be  identical  to  that  of  the  original  field. 

The  importance  of  this  interpretation  lies  in  the  fact,  which  will  be  proved 
in  the  following  sections,  that  the  optimal  estimate  for  /  can  be  obtained  simply 
by  observing  the  equilibrium  behavior  of  this  coupled  field.  Before  considering  this 
question  in  detail,  let  us  define  the  appropriate  cost  fimctionals  for  the  applications 
we  are  interested  in. 

3.  Cost  Functionals. 

The  Bayesian  approach  to  the  solution  of  reconstruction  problems  has  been 
adopted  by  several  researchers.  In  most  cases,  the  criterion  for  selecting  the  optimal 
estimate  has  been  the  maximization  of  the  posterior  probability  (the  Maximum  a 
Posteriori  or  MAP  estimate).  It  has  been  used,  for  example,  by  Geman  and  Geman 
(1984)  for  the  restoration  of  piecewise  constant  images;  by  Grenander  (1984)  for 
pattern  reconstruction,  and  by  Elliot  et.  al.  (1983)  and  Hansen  and  Elliot  (1982)  for 
the  segmentation  of  textured  images  (a  similar  criterion  —  the  maximization  of  a 
suitably  defined  likelihood  function  —  has  been  used  by  Cohen  and  Cooper  (1984) 
for  the  same  purposes). 

Since  the  use  of  this  criterion  defines  the  optimal  estimator  as  the  global 
minimizer  of  the  posterior  energy  Ur  (equation  6),  it  is  closely  related  to  the 
standard  regularization  method  that  we  discussed  in  chapter  1.  Indeed,  if  we  assume 
quadratic  potentials  for  the  prior  MRF  model,  the  term  C/0(/)  corresponds  to  a 


global  smoothness  assumption  (the  "stabilizing  functional"),  and  if  the  observations 
are  corrupted  by  additive  Gaussian  noise,  the  term  *£  <!>,(/,  </,)  will  also  be  quadratic, 
so  that  Ur  will  have  a  unique  minimum.  For  more  general  prior  and  observation 
models,  the  MAP  estimator  may  be  considered  as  an  extension  of  the  standard 
regularization  approach.  Thus,  the  variational  principle  proposed  by  Blake  (1983), 
on  a  purely  pragmatic  basis,  for  the  reconstruction  of  piecewise  constant  images  is 
very  similar  to  the  one  derived  by  Geman  and  Geman  (1984).  Even  in  this  case, 
however,  the  precise  probabilistic  formulation  in  the  latter  case  is  preferable,  since 
it  provides  a  precise  interpretation  of  the  parameters,  and  a  practical  means  for 
verifying  the  adequacy  of  the  prior  assumptions  (via  the  experimental  analysis  of 
sample  fields). 

In  some  other  cases,  a  performance  criterion,  such  as  the  minimization  of  the 
mean  squared  error  has  been  implicitly  used  for  the  estimation  of  particular  classes 
of  fields.  For  example,  for  continuous-valued  fields  with  exponential  autocorrelation 
functions,  cormpted  by  additive  white  Gaussian  noise,  Nahi  and  Assefi  (1972)  and 
Habibi  (1972)  have  used  causal  linear  models  and  optimal  (Kalman)  linear  filters 
for  solving  the  reconstruction  problem. 

The  minimization  of  the  expected  value  of  error  functionals,  however,  has  not 
been  used  as  an  explicit  criterion  for  designing  optimal  estimators  in  the  general 
case.  We  will  show  that  this  design  criterion  is  in  fact  more  appropriate  in  our  case, 
for  the  following  reasons: 

(i)  It  permits  one  to  adapt  the  estimator  to  each  particular  problem. 

(ii)  It  is  in  closer  agreement  with  one’s  intuitive  assessment  of  the  performance 

of  an  estimator. 

(iii)  It  leads  to  attractive  computational  schemes. 

We  will  now  propose  design  criteria  for  two  particular  problems:  image 
segmentation  and  surface  reconstruction. 

3.1.  Error  Criterion  for  the  Segmentation  Problem. 

Consider  a  field  /  with  N  elements  each  of  which  can  belong  to  one  of  a  finite 


set  Qt  of  classes.  Let  /,  denote  the  class  to  which  the  ilh  element  belongs.  The 
segmentation  problem  is  to  estimate  /  from  a  set  of  observations  {g i, . . . ,g}) }.  Note 
that  fi  does  not  necessarily  correspond  to  the  image  intensity.  It  may  represent,  for 
example,  the  texture  class  for  a  region  in  the  image  (as  in  Elliot  et.  al.,  1983),  etc. 

A  reasonable  criterion  for  the  performance  of  an  estimate  f  is  the  number  of 
elements  that  are  not  classified  correctly.  Therefore,  we  define  the  segmentation 
error  ea  as: 

M/./)=  £(1  -*(/<-&)  Ji'fiZQi  (7) 

»=1 

where 

fl,  if  a  =  0 

10,  otherwise 

3.2.  Error  Criterion  for  the  Reconstruction  Problem. 

In  this  case,  we  also  consider  a  field  /  with  N  elements  which  can  take  values 
on  finite  sets  {Q{},  but  now  we  assume  specifically  that  /,  represents  the  intensity 
of  an  image  (or  the  height  of  a  surface)  at  site  i.  This  suggests  that  an  estimate  f 
should  be  considered  "good"  if  it  is  close  to  /  in  the  ordinary  sense,  so  that  the 
total  squared  error: 

«,(/,/)  =  HUi-hf  w 

i=l 

will  be  a  reasonable  measure  for  its  performance. 

Let  us  now  derive  the  optimal  estimators  for  these  error  criteria. 

4.  Optimal  Bayesian  Estimators. 

To  derive  the  optimal  estimators  with  respect  to  the  criteria  stated  above,  we 
first  present  the  general  result  (which  can  be  found,  for  example  in  Abend,  1968) 
which  states  that  if  the  posterior  marginal  distributions  for  every  element  of  the  field 
are  known,  the  optimal  Bayesian  estimator  with  respect  to  any  additive,  positive 


definite  cost  functional  C  may  be  found  by  independently  minimizing  the  marginal 
expected  cost  for  each  element. 

In  more  precise  terms,  we  will  consider  cost  functionals  C{f,  f)  of  the  form: 

=  (10) 

1 6L 


with 

(=  0,  if  o  =  fe 

C,(a, 

1>  0,  if  ,  for  all  i 

We  will  assume  that  the  value  of  each  element  /,  of  the  field  /  is  constrained 

to  belong  to  some  finite  set  Qi  (the  generalization  to  the  case  of  compact  sets 

* 

is  straightforward).  The  Optimal  Bayesian  estimator  /  with  respect  to  the  cost 
functional  C  is  defined  as  the  global  minimizer  of  the  expected  value  of  C  over  all 
possible  /  and  g: 

C{}')  =  ll}C(!,f  )iPL,U,g)  = 

=  w/r>C(/,  })dP,, ,(/,„)  (11) 

We  now  have: 

Theorem  1: 


The  optimal  estimate  of  a  field  /  with  respect  to  the  positive  definite  cost  functional 
C  can  be  found  by  minimizing  independently  the  marginal  expected  cost  for  each 
element,  i.e., 

/»=<?:  ]£  Ci{r,q)Pi{r  |  g)  <  £  Ci{r,  s)P,(r  |  ff) 

r€Qi  *■€<?. 

I 

for  all  8  ^q,  and  for  all  *  6  L. 

Pt(r  |  g )  is  the  posterior  marginal  distribution  of  the  element  *: 

'  W  |  g)  =  E 


(12) 


■■  ■  *  \9  •*  i-  r  :  w. 


i 


Proof: 


First,  we  note  that  since  C  is  positive  dennitc.  and  since 

P/,.(/.9)  = 

where  is  a  constant  for  a  given  set  of  observations,  we  can  write,  from  (11): 

£  C(/,  />/!„(/;  ?)  =  inf  £  C(f,  f)P/\tU;  9 ) 

/  f  f 

Using  (10),  we  rewrite  the  right  hand  side  as: 

/  /  . 

=  i"f  EE  <?(/,.  /i)P/|,(/i  9)  - 
=  i»f£E  E  <7(r,  ?,■)/’,!,(/;  9) 

/  «  reQi  /:/.-= f 

From  (12),  we  find  that  this  expression  is  equal  to: 

=  ‘nf  £  £  Ci(r,  /,) P,(r  |  g) 

I  i 

which,  since  C  is  positive  definite,  we  can  rewrite  as: 

£  inf  £  °i(r>  fi)pi{r  I  9)  ■ 

«  A  r€^< 


The  optimal  estimators  for  the  error  criteria  defined  in  section  3,  can  be  easily 
derived  from  this  result: 

In  the  case  of  the  segmentation  problem,  we  put 


and  get  that 


CiUi.fi)  =!-«(/,  -/.) 


E<1-'S(’-./.)Pv(r|j)=l-P,(/,  |  g) 

'£Qt 
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and  therefore, 

/*  =  q  €  Qi  :  A(<?  |  <?)  >  ^(s  I  9) 

for  all  s  ^  <7  ( 

We  will  call  this  estimate  the  "Maximizer  of  the  Posterior  Marginals"  Ompm)- 
For  the  reconstruction  problem,  we  set: 

c, (/„/,)  =  (/.  -  h? 


implies  that 


£  (r  -  q)2Pi(r  I  9)  <  E  O’  -  s)2p*(r  I  *) 

red. 


— 2qr  +  9^  —  2sr  +  $ 


or  equivalently. 


(r  -  qf  <(r-  s)2 


where 


so  that  the  optimal  estimate  is: 


r  =  E  rF*(r  I  9) 

r^Qt 


/‘-9GQ,  :  ft- 9)2  <(/.-- »)2 
for  all  8=£  q 

We  will  call  this  estimate  the  "Thresholded  Posterior  Mean"  C/tpm)- 

•*> 

Note  that  these  results  still  hold  if  the  sets  Q,  of  allowable  values  for  each 
element,  or  the  individual  cost  criteria  C<  are  not  the  same  for  all  i.  In  particular, 
we  may  assume  that  the  index  i  varies  over  the  union  of  two  lattices: 

*  €  £1  U  Li 
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and  let  the  field  at  the  sites  of  L\  represent  the  height  of  a  piecewise  smooth  surface, 
and  at  the  sites  of  Z^,  take  an  integer  value  to  indicate  the  presence  (and  possibly 
the  direction)  of  a  boundary  between  two  adjacent  continuous  patches  (see  Geman 
and  Geman,  1984;  we  will  explain  this  construction  in  detail  in  chapter  5).  If  we 
now  define  a  mixed  error  functional: 

«.(/,  7)  =  £  (/.  -  hf  +  >>£(!-  *(/i  -  /,)) 

i£Li 

for  any  positive  value  of  X,  the  optimal  estimate  will  be: 

j*  _  i  €  Li 

I/mi’mW-  i  €  L>. 2 

The  main  obstacle  for  the  practical  application  of  these  results,  lies  in  the 
formidable  computational  cost  associated  with  the  exact  computation  of  the  marginals 
and  the  mean  of  the  posterior  distribution  given  by  (5),  even  for  lattices  of  moderate 
size.  In  the  next  section  we  will  present  a  general  distributed  procedure  that  will 
permit  us  to  approximate  these  quantities  as  precisely  as  we  may  want 

5.  Algorithms. 

The  algorithms  that  we  will  propose  are  based  on  the  use  of  the  Metropolis  or 
Gibbs  Sampler  schemes  that  we  presented  in  chapter  2,  to  simulate  the  equilibrium 
behavior  of  the  coupled  MRF  described  by  equation  (5).  We  recall  that  the  Markov 
chain  generated  by  these  algorithms  is  regular,  and  their  invariant  measure  is  the 
posterior  distribution  Pf\9‘  The  law  of  large  numbers  for  regular  chains  (see,  for 
example,  Kemeny  and  Snell,  1960)  establishes  that  the  fraction  of  time  that  the 
chain  will  spend  on  a  given  state  /  will  tend  to  P/\g{f  ,g)  as  the  number  of  steps 
gets  large,  independently  of  the  initial  state.  This  means  that  we  can  approximate  f 
by: 

iAt  if  (15) 

n  —  k  t=k 

and  the  posterior  marginals  by: 

pi(<1 1  9)  ~  tt~  £  6(fi]  ~  ?)  (16) 

k  -  n  t=k 
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vv here  /*'*  is  the  configuration  generated  by  the  Metropolis  algorithm  at  time  £, 
and  k  is  the  time  required  for  the  system  to  be  in  thermal  equilibrium.  From  these 
values,  fMrM  and  }TrM  can  be  easily  computed  using  (13)  and  (14). 

This  procedure  is  related  to  the  use  of  simulated  annealing  (see  section  4  of 
chapter  2)  for  finding  the  global  minimum  of  (7/>  (i.e.,  the  MAP  estimate:  see 
Gemun  and  Geman,  1984).  In  our  case,  however,  we  are  interested  in  gathering 
statistics  about  the  equilibrium  behavior  of  the  coupled  field  at  a  fixed  temperature 
T  =  l,  rather  than  in  finding  the  ground  state  of  the  system.  This  fact  gives  our 
procedure  some  distinct  advantages: 

1.  It  is  difficult  to  determine  in  general  the  descent  rate  of  the  temperature 
(annealing  schedule)  that  will  guarantee  the  convergence  of  the  annealing 
process  in  a  reasonable  time  (it  usually  involves  a  trial  and  error  procedure). 
Since  we  are  running  the  Metropolis  algorithm  at  a  fixed  temperature,  this 
issue  becomes  irrelevant 

2.  Since  in  our  case  we  are  using  a  Monte  Carlo  procedure  to  approximate 
the  values  of  some  integrals,  we  should  expect  a  nice  convergence  behavior,  in 
the  sense  that  coarse  approximations  can  be  computed  very  rapidly,  and  then 
refined  to  an  arbitrary  precision  (in  fact,  it  can  be  proved  (see  Feller,  1950) 
that  the  expected  value  of  the  squared  error  of  the  estimates  (15)  and  (16)  is 
inversely  proportional  to  n). 


The  main  disadvantage  of  this  procedure  is  that  in  the  case  of  the  segmentation 
problem,  a  large  amount  of  memory  might  be  required  if  the  number  of  classes 
per  element  m  is  large  (we  need  to  store  the  N(m  -  1)  numbers  that  define  the 
posterior  marginals). 


With  respect  to  the  relative  performance,  we  point  out  that  in  many  cases, 
particularly  for  high  signal  to  noise  ratios,  the  MAP  estimate  is  usually  close  to  the 
optimal  one.  If  the  noise  level  is  high,  however,  the  difference  in  the  performances 
of  the  two  estimators  may  be  dramatic.  This  is  illustrated  in  the  example  portrayed 
in  figure  6:  panel  (c)  represents  the  MAP  estimate  of  the  binary  MRF  (a)  from  the 
noisy  observations  (b);  it  is  clear  that  the  approximations  to  the  MPM  estimates 
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shown  in  panels  (d)  and  (e)  are  better  than  the  MAP  from  almost  any  viewpoint.  An 
intuitive  explanation  for  this  behavior  comes  from  the  fact  that  the  MAP  estimator 
is  implicitly  minimizing  the  expected  value  of  a  cost  functional  Cmai'U,  f)  which  is 
equal  to  zero  only  if  /,  =  /,  for  all  i,  and  is  equal  to,  say.  M  otherwise.  If  the  signal 
to  noise  ratio  is  sufficiently  high,  the  expected  value  of  the  optimal  segmentation 
error  will  be  very  close  to  zero,  so  that  Smi-m  /ma/1  will  coincide.  In  a  high 
noise  situation,  however,  the  MAP  estimator  will  tend  to  be  too  conservative,  since 
from  its  viewpoint  it  is  equally  costly  to  make  one  or  one  thousand  mistakes.  The 
MPM  estimator,  in  contrast,  can  make  a  better  (although  more  risky)  guess,  since 
making  a  few  mistakes  has  only  a  marginal  effect  on  the  expected  cost.  We  will 
return  to  this  example,  and  analyze  in  detail  the  relative  performance  of  both 
estimates  in  the  next  chapter. 

6.  Computational  Complexity  and  Parallel  Implementations. 

We  have  seen  how  the  optimal  solutions  of  reconstruction  problems  ,  for  a 
large  class  of  cost  criteria,  can  be  obtained  from  the  observation  of  the  evolution 
of  the  Markov  chain  generated  by  the  algorithms  presented  in  chapter  2.  In  this 
section,  we  will  discuss  the  following  questions: 

(i)  Which  of  these  algorithms  is  the  best  one  to  use  on  a  serial  machine,  from 

the  viewpoint  of  the  computational  efficiency. 

(ii)  Which  one  is  best  suited  for  an  implementation  in  parallel  hardware. 

We  will  also  describe  a  parallel  machine  that  is  currently  under  construction  at 
Thinking  Machines  Corporation  and  at  the  MIT  Artificial  Intelligence  Laboratory: 
the  "Connection  Machine"  (Hillis,  1985),  and  present  estimates  for  the  execution 
time  of  these  algorithms  in  that  particular  piece  of  hardware. 

6.1.  Serial  Complexity. 

Suppose  we  are  running  our  algorithms  on  a  serial  machine.  In  the  three  cases 
(Metropolis,  Heat  Bath  and  Gibbs  Sampler),  we  first  have  to  select  the  next  site 


whose  state  has  to  be  updated.  Assume  it  is  site  t.  Let  AC/,  denote  the  increment 
in  the  posterior  energy  associated  with  replacing  the  value  of  the  state  of  the  t1* 
element  by  the  value  q.  Using  (6)  and  the  expression  for  C/0  of  (4),  we  get: 


At/,  =  i  £  (Vc(/I«>)  -  VcU))  +  S.)  -  *,(/,  ft)  (15) 

C:»€C 


I 

where 


(1®) 


Let  C(AU)  denote  the  computational  cost  of  evaluating  (15). 


The  necessary  steps  for  updating  the  state  of  site  t  are,  in  the  Metropolis  scheme 
(see  section  3.1  of  chapter  2): 

(i)  Select  the  candidate  state  q  from  the  set  Q,  (generate  a  uniform  pseudo¬ 
random  number  in  the  range  (0,  |(2,|],  with  cost  C[prn),  and  load  q  from 
a  table,  with  cost  C(load)). 

(ii)  Compute  A Uq. 

(iii)  Check  if  A Uq  >  0  (cost:  C{comp)).  If  not,  set  /,  =  q.  Otherwise,  go  to 

(iv) : 

(iv)  Compute  exp[-AU,,]  (cost:  C(ezp)). 

(v)  Generate  a  new  uniform  pseudo-random  number  in  the  range  (0, 1). 

(vi)  Compare  it  with  exp[-AC/,]. 

Therefore,  we  have  that  the  total  updating  cost  for  the  Metropolis  scheme,  CM, 
satifies: 

Cm  >  C(AU)  +  c(prn)  +  C[comp)  +  C(load) 

Cm  <  C(AU)  +  2 C{prn)  +  C(ezp)  +  2 C{comp)  +  C{load)  (17) 

For  the  Heat  Bath  scheme,  steps  (i),  (ii)  and  (iv)  are  identical,  and  step  (iii)  is 
deleted.  The  remaining  steps  are  in  this  case: 

(v)  Generate  a  new  uniform  pseudo-random  number  r  in  the  range  (0, 1  + 
exp[-At/,]] 

(vi)  If  r  >  l,  set  /,  =  q\  otherwise,  leave  /,  unchanged. 

The  updating  cost  for  the  Heat  Bath  scheme,  Chb  is  then: 

Cub  =  C(AU )  +  2  C(prri)  +  C(exp)  + 

C(comp)  +  C(add)  +  C(load)  (18) 

and  in  general,  it  will  be  higher  than  Cm.  since 

C(czp)  >  >  C(comp) 

For  the  Gibbs  Sampler,  we  select  the  new  state  by  generating  a  pseudo-random 
number  which  takes  values  on  Q, ,  with  probabilities  given  by  the  conditional 
distribution  (equation  (1)  of  chapter  2).  To  do  this  efficiently,  we  rewrite  this 


equation  as: 


^(9  I  /)  = 


cxp[-At/? _ 

ZreQ.  exp[— At/f] 


(Note  that  At//,  =  0). 

Let  <2,  =  {qu . .  .,qM}.  We  now  generate  an  array  a,  by  putting: 

ao  =  0 


aj  =  ay-i  +  exp[-At/,y]  ,  >  =  1 . M 

The  new  state  /,  is  now  computed  by  generating  a  uniform  pseudo-random  number 
r  in  the  range  (0, aM],  and  putting: 

=  5  r6(a,_  ,,a/J 


The  computational  cost  will  be: 

Ccs  —  —  l)[C(At/)  +  C(exp)  +  2 C(add)  +  4 C(load)  -4-  C{comp)\  + 

+C(prn )  (19) 

note  that  we  are  including  the  overhead  cost  incurred  by  the  use  of  the  auxiliary 
array  a. 

If  N  is  the  size  of  the  lattice,  and  we  perform  n  iterations  to  compute  our 
estimate,  the  total  cost  will  be: 

Cj  —  N  ■  n  ■  ( C{update )  +  C(select)  +  C(overhcad))  (20) 

where  C(select)  is  the  cost  associated  with  the  selection  of  the  next  site  whose 
state  is  going  to  be  updated.  This  selection  involves  the  generation  of  2  uniform 
pseudo-random  numbers  in  the  first  two  cases,  whereas  for  the  Gibbs  sampler  it 
requires  only  a  couple  of  additions,  since  in  this  case  we  can  select  the  next  site 
using  a  deterministic  rule,  such  as  lexicographic  order  (see  section  6.3  below). 


Since  C(updale)  is  the  dominant  cost,  apparently  one  should  conclude  that  the 
Metropolis  algorithm  is  the  most  efficient,  it  must  be  considered,  however,  that 
as  the  si/e  of  the  state  space  (i.e„  M  =  JQ;j)  increases,  the  number  of  iterations 
needed  to  get  an  estimate  with  an  equivalent  degree  of  precision  will  increase  much 
faster  in  the  Metropolis  or  Heat  Bath  cases,  than  in  the  Gibbs  sampler,  since  in  the 
latter  case  we  are  using  an  "importance  sampling"  procedure,  versus  the  uniform 
sampling  of  the  former  (see  Hammersley  and  Handscomb,  1965). 

A  rigorous  analysis  of  the  tradeoffs  involved  is  not  easy,  and  is  highly  dependent 
on  the  nature  of  the  particular  problem,  so  that  an  experimental  analysis  might  be 
needed  to  clarify  these  questions  in  each  case.  In  the  more  interesting  case  of  a 
parallel  implementation,  however,  the  Gibbs  sampler  becomes  the  obvious  choice. 
We  will  justify  this  assertion  in  the  following  sections. 

6.2.  Parallel  Updating. 

A  necessary  condition  for  the  convergence  of  the  probability  measures  of  the 
Markov  chains  defined  by  the  Metropolis,  Heat  Bath  or  Gibbs  Sampler  algorithms 
to  the  Gibbs  measure  is  that  if  two  sites  belong  to  the  same  clique,  they  are  never 
updated  at  the  same  time.  As  we  will  show  in  the  next  section,  this  condition  is 
also  sufficient  only  for  the  case  of  the  Gibbs  sampler.  In  this  case  it  is  possible  to 
update  simultaneously  the  states  of  all  non-neighboring  sites,  by  implementing  the 
algorithm  in  a  parallel  architecture  in  which  a  processor  is  assigned  to  each  site. 
The  total  execution  time  will  then  be  reduced  by  a  factor  of 

N 

K 

where  K  is  the  so  called  "chromatic  number"  of  the  graph  that  describes  the 
neighborhood  structure,  and  it  is  equal  to  the  minimum  number  of  colors  needed 
to  color  the  sites  of  the  lattice  in  such  a  way  that  no  two  neighbors  are  the  same). 
Note  that  if  the  state  of  every  site  is  allowed  to  take  real  (continuous)  values,  we 
may  use  a  numerical  simulation  of  the  stochastic  differential  equation: 


df  =  — gra  dU(f)dt  +  s/2Tdw 


to  generate  sample  configurations  from  the  desired  distribution  (see  section  3.3  of 
chapter  2).  In  this  case,  all  sites  can  be  updated  at  the  same  time,  so  that  a  parallel 
implementation  can  reduce  the  complexity  by  a  factor  of  N. 


6.2.1.  Convergence  of  the  Gibbs  Sampler. 

Geman  and  Geman  (1984)  established  that  the  measure  of  the  Markov  chain 
defined  by  the  Gibbs  sampler  will  converge  to  the  Gibbs  measure  independently  of 
the  initial  state,  independently  of  the  order  in  which  the  sites  are  updated  (provided 
only  that  we  keep  visiting  every  site,  i.e.,  that  we  update  its  state  infinitely  often). 
The  convergence  of  the  parallel  implementation,  therefore,  follows  from  this  general 
result  for  which  we  present  here  a  simple  alternative  proof: 

First,  we  note  that  from  the  definition  of  a  MRF,  it  follows  that  for  every  site 
i,  every  value  q  e  Qt ,  and  every  configuration  /,  the  conditional  probability, 

p  *■{/.  =  q\  fj  ,  j  9^  0  >  o 

Since  by  hypothesis  every  site  is  visited  infinitely  often,  this  implies  that  any 
two  states  of  the  chain  will  be  mutually  accessible  (with  positive  probability)  in  a 
finite  number  of  steps,  which  means  that  the  Gibbs  sampler  defines  a  regular  chain. 

On  the  other  hand,  the  Gibbs  measure  n (/)  is  an  invariant  probability  vector 
of  the  chain.  To  see  this,  suppose  that  at  time  t,  just  before  updating  site  the 
possible  configurations  of  the  field  F(t)  are  distributed  according  with  the  Gibbs 
measure: 

Pr(F(t)  =  /)  =  *(/) 


After  the  update  we  have: 


Pr(F(!  +  1)  =  /)  -  P mt  +  1)  =  /,|  F,{t)  =  /,  ,  jV  .)  • 


Pr(fV(f)  =  f,  , 


j  ^  i )  =  7 r(/) 


=  Afi !  fi  >  j  7^  0  •  Afj 

because,  by  the  definition  of  the  algorithm,  the  new  state  of  the  ith  element  is 
selected  randomly  according  with  the  conditional  Gibbs  distribution.  The  proof  is 
now  completed  by  remembering  a  well  known  theorem  for  finite  Markov  chains 
(see  Kemeny  and  Snell,  1960)  that  establishes  that  every  regular  Markov  chain: 

(i)  Has  a  unique  invariant  probability  measure. 

(ii)  The  measure  of  the  chain  will  converge  (with  probability  1)  to  this  invariant 
measure  independently  of  the  initial  probability  distribution  of  the  states. 

Note  that,  unlike  the  Metropolis  and  Heat  Bath  algorithms,  the  convergence  of 
the  Gibbs  sampler  does  not  depend  on  the  reversibility  of  the  chain  (or  equivalently, 
on  the  satisfaction  of  the  "detailed  balance"  condition  given  by  equation  (3)  of 
chapter  2),  although  this  condition  will  hold  if  we  use  it  with  a  random  updating 
order.  We  will  now  see  that  the  reversibility  will  not  hold  in  general  if  we  use  a 
parallel  updating  scheme,  which  will  make  the  first  two  algorithms  unsuitable  for 
parallel  implementations. 

6.2.2.  Breakdown  of  Reversibility  for  Parallel  Updating. 


To  show  why  this  condition  is  violated  (by  the  three  algorithms)  when  a  parallel 
updating  scheme  is  used,  we  will  consider  a  first  order,  binary  MRF  on  a  lattice  L 
with  Ising  potentials,  that  is, 


/,  €  {0, 1}  for  all  t  €  L 
-1,  if  \i-j\  =  1  and  /,•  =  /y 

Vcifu  fj)  =1,  if  |i  -j  |  =  1  and  U  ^  fj 

>0,  otherwise 

To  implement  a  parallel  updating  scheme,  we  divide  the  sites  of  the  lattice  into 
two  non-overlapping  sets,  which  we  will  call  B  and  W  (the  sets  of  "black"  and 
"white"  sites,  respectively)  as  illutrated  in  figure  7. 


Let  fw,  In  denote  the  state  of  the  elements  belonging  to  W  and  B ,  respectively.so 
that  /  =  {fw,fn}-  The  parallel  updating  scheme  consists  in  updating  first,  say,  all 


•  O  •  O  • 

o  •  o  •  o 

•  o  •  o  • 

o  •  o  •  o 

•  O  •  O  • 

Figure  7.  Non-overlapping  sets  for  parallel  updating  (see  text) 

the  white  sites,  and  then  all  the  black  ones.  Note  that  the  random  variables  associated 
with  any  two  sites  of  the  same  color  are  conditionally  independent  (given  the  state 
of  the  dements  of  the  other  color),  which  means  that  the  order  in  which  their  state 
is  updated  is  immaterial,  so  that,  in  fact,  they  can  be  updated  simultaneously. 

Let  Pw,Pu ,  denote  the  transition  probabilities  corresponding  to  an  update 
of  all  the  white  and  black  sites,  respectively.  Note  that  botn  Markov  chains  with 
transition  probabilities  Pw  and  Pb  satisfy  the  detailed  balance  condition  (although 
they  are  clearly  not  regular),  so  that  for  a  fixed  fa,  we  have: 

Pw[{/w,/d},  {/w»/b})  =  /b},  {fw,  /b}) 

and  similarly,  for  a  fixed  fw, 

Po({fw,fo},  {/w./b})  =  ir(yw’  /a),  {/w/a}) 

where  n  is  the  Gibbs  measure  of  the  complete  configuration  /  =  {/w,/b}- 

Now,  let  Pwn{f,f )  be  the  transition  probability  associated  with  a  complete 
"white-black"  update  (where  the  white  elements  are  updated  first).  We  have: 

Pwu(f,f)  =  Pw{{fw  >  /b}>  {/w»/b})^b({/w*/b}»  {/w>/b})  = 
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=  Pw({fw>  fll},  {fw,  /»})y| ' 

■p»{Ow:h)XfwJ,A)*%^  = 

nfwJn) 

=  ^Pnw(},f) 

where  Puw  is  the  transition  probability  of  the  converse  "black-white"  update  (black 
sites  visited  first). 

Now,  consider  die  particular  configuration: 


JO,  iew 

*11.  ieB 

and  let 

fi  =  1  for  all  *  6  L 

Clearly, 

Pawihf)  >  PwnCfJ) 

and  so, 

*(f)PwD(f,f)  >  A/)PwbCU) 

so  that  the  detailed  balance  condition  does  not  hold. 

The  above  argument  can  be  easily  generalized  to  show  that  if  we  use  any 
prescribed  updating  order  (such  as  lexicographic  order),  the  Markov  chain  generated 
by  any  of  the  three  algorithms  will  also  become  irreversible.  These  chains,  however, 
will  remain  regular,  which  means  that  in  each  case,  the  probability  distribution  of 
the  configurations  generated  by  the  chain  will  converge  towards  a  unique  invariant 
distribution.  In  general,  however,  it  will  not  be  possible  to  guarantee  the  coincidence 
of  this  invariant  measure  with  the  desired  Gibbs  distribution,  except  in  the  case  of 
the  Gibbs  sampler. 


An  example  of  a  situation  in  which  the  invariant  distribution  is  not  the 
Gibbsian  measure,  can  be  obtained  by  running  the  Metropolis  algorithm,  either 


with  lexicographic  or  "black-white"  updating  order  for  the  Ising  model  discussed  in 
section  2  of  chapter  2.  If  the  natural  temperature  is  below  the  critical  temperature 
of  the  infinite  lattice,  the  algorithm  will  produce  equilibrium  configurations  that 
are  almost  completely  uniform,  and  therefore,  inconsistent  with  the  theoretical 
predictions  (and  with  the  behavior  of  the  same  algorithm  when  random  updating 
order  is  used).  The  Gibbs  Sampler  (which  in  this  case  is  equivalent  to  the  Heat 
Bath  scheme),  on  the  other  hand,  produces  consistent  results,  as  expected. 

6.3.  Discussion. 

The  previous  results  mean  that  the  expected  computational  cost  (execution 
time)  for  the  solution  of  a  reconstruction  problem  on  a  large  parallel  machine,  using 
our  general  Monte  Carlo  procedure,  will  be  given  by: 

Cp  =  n  ■  K  ■  Cos  (21) 

where  n  is  the  number  of  (global)  iterations;  K  is  the  chromatic  number  of  the 
graph  of  the  underlying  Markov  model,  and  Cqs  is  the  updating  cost  of  the  Gibbs 
Sampler,  given  by  equation  (19). 

An  example  of  such  a  massively  parallel  architecture  is  the  "Connection 
Machine"  (Hillis,  1985).  This  machine  was  originally  designed  for  the  parallel 
processing  of  structured  symbolic  expressions,  such  as  frames  and  semantic  networks. 
It  is  a  "Single  Instruction  Multiple  Data"  (SIMD)  array  processor  consisting  of 
256,000  processing  units  (each  with  a  single  bit  Arithmetic/Logical  unit,  and  about 
4K  bits  of  storage)  organized  in  a  four-connected  lattice  that  is  512  elements 
square.  Besides  this  nearest-neighbor  connectivity,  it  will  also  be  possible  (although 
computationally  more  expensive),  to  connect  any  two  processors  in  the  array  using 
a  "Cross  Omega"  router  network  (Knight,  in  Winston,  1984). 

At  each  cycle  of  the  machine,  for  which  we  will  assume  a  duration  of 
one  microsecond,  an  instruction  is  executed  by  each  processor,  and  a  single 
bit  is  transmitted  to  its  neighbors.  This  means  that  the  updating  scheme  can 
be  implemented  most  efficiently  if  the  field  is  first  order  Markov,  but  higher 
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order  processes  can  also  be  implemented  without  using  the  router  by  successively 
propagating  the  transmited  state  (the  execution  time,  therefore,  will  grow  linearly 
with  the  order  of  the  field). 

To  make  these  results  more  concrete,  consider,  as  an  example,  the  problem  of 
finding  the  optimal  estimate  for  an  M-ary,  first  order  MRF  with  Ising  potentials 
(i.e.,  the  segmentation  of  a  piecewise  constant  image)  from  noisy  observations  (we 
will  analyze  this  problem  in  detail  in  the  next  chapter).  Let  us  assume  that  the 
estimator  is  to  be  implemented  in  the  "Connection  Machine",  and  suppose  that  by 
the  use  of  appropriate  scaling  factors,  all  the  numbers  can  be  represented  as  16-bit 
integers.  We  will  use  the  following  conservative  assumptions:  We  assume  that  16 
cycles  of  a  single  1-bit  processor  are  needed  to  perform  16-bit  addition,  substraction 
or  comparison;  162  cycles  to  perform  multiplication  or  division;  2  x  162  cycles  for 
generating  a  pseudo-random  number  with  uniform  distribution  on  a  given  interval; 
16  cycles  for  memory  transfer  operations,  and  6  x  162  cycles  for  computing  an 
exponential. 

Assuming  that  we  run  250  iterations  of  the  system,  and  ignoring  the  overhead 
time  we  get,  from  (19)  and  (21), 

Cp  sss  1.4  (M  -  1)  seconds  (22) 

Although  this  execution  time  may  be  reasonable  in  many  cases,  it  is  clear  that 
this  approach  becomes  impractical  as  M  becomes  large.  In  this  case,  it  might  be 
more  convenient  to  approximate  the  field  by  one  in  which  the  state  at  each  site  takes 
continuous  values  in  a  compact  set  and,  provided  that  Up  satisfies  the  appropriate 
smoothness  conditions,  use  the  stochastic  differential  equation: 

df  =  -grad Up  dt  +  \frrdvi  (23) 

where  w  is  a  Wiener  process,  to  simulate  the  behavior  of  the  system  (see  chapter  2, 
section  2.2). 

This  scheme  will  not  work,  however,  if  some  of  the  variables  are  intrinsically 
discrete  (e.g.,  binary  variables  indicating  the  presence  or  absence  of  a  boundary).  In 


this  case,  it  might  still  be  possible  to  use  a  mixed  scheme  in  which  the  state  of  the 
discrete  variables  is  updated  using  the  Gibbs  Sampler,  and  that  of  the  continuous 
ones  using  equation  (23),  but  the  precise  form  of  such  mixed  schemes  has  not  been 
determined,  nor  their  convergence  properties  established. 

These  considerations  provide  us  with  a  strong  motivation  for  finding  alternative 
ways  of  solving  these  problems.  In  particular,  much  more  research  is  needed  in  the 
following  directions: 

(i)  Design  of  more  efficient  (possibly  deterministic)  algorithms  for  approximat¬ 
ing  the  optimal  estimators  for  particular  classes  of  problems. 

(ii)  Design  of  analog  and  hybrid  networks  for  implementing  these  kinds  of 
algorithms. 

We  will  study  these  possibilities  in  detail,  in  the  context  of  specific  problems 
in  the  following  chapters. 


Chapter  4 


RECONSTRUCTION  OF  PIECEWISE  CONSTANT  FUNCTIONS 


I.  Introduction. 


In  this  chapter  we  will  apply  the  optimal  Bayesian  estimators  that  we  have 
developed,  to  the  problem  of  reconstructing  piecewise  constant  functions  from  noisy 
observations.  The  efficient  solution  of  this  problem  is  relevant  for  several  reasons: 

(i)  Binary  images  (or  images  consisting  of  only  a  few  grey  levels)  are  directly 
useful  in  many  interesting  applications  (for  example,  object  recognition 
and  manipulation  in  restricted  (industrial)  environments). 

(ii)  Several  perceptual  problems,  such  as  the  segmentation  of  textured  images 
(Elliot,  et.  al.  (1983);  Hansen  and  Elliot  (1982);  Cohen  and  Cooper  (1984)), 
or  the  formation  of  perceptual  clusters  (0‘CalIahan  (1974);  Marroquin 
(1976)),  can  be  reduced  to  the  problem  of  reconstructing  a  piecewise 
constant  surface. 

(iii)  As  we  will  see  in  the  next  chapter,  where  we  treat  the  reconstruction  of 
piecewise  smooth  surfaces,  the  boundaries  between  continuous  patches  can 
be  adequately  modeled  by  binary  fields  coupled  with  continuous  valued 
processes.  These  coupled  systems  are  very  difficult  to  analyze  in  a  rigorous 
way.  We  hope  to  increase  our  understanding  of  them  by  studying  first  the 
estimation  of  binary  fields. 


2.  Problem  Formulation. 


Following  Geman  and  Geman  (1984),  we  will  model  the  behavior  of  piecewise 
constant  functions  using  first  order  MRF  models  on  a  finite  lattice  with  generalized 
Ising  potentials: 


(1) 


j-1,  if  \i  -j\  =  1  and  /,  =  /, 

vc{f„  fj)  =  1 1.  if  I*  -  ;!  =  1  and  /,  ^  f} 

10,  otherwise 

fi  €  Qi  =  . . ,9m}  for  alii 

We  will  use  a  free  boundary  model,  so  that  the  neighborhood  size  for  a  given 
site  will  be:  4,  if  it  is  in  the  interior  of  the  lattice;  3,  if  it  lies  at  a  boundary,  but  not 
at  a  corner,  and  2  for  the  corners. 

'Hie  Gibbs  distribution: 


P/U)  =  \  e*P|-^r£W)] 

tW)  =  X>(/i,/i)  (2) 

defines  a  one  parameter  family  of  models  (indexed  by  Tb)  describing  piecewise 
constant  patterns  with  varying  degrees  of  granularity. 

Using  the  general  stochastic  model  for  the  observation  process  presented  in 
section  2.1  of  chapter  3,  we  get  the  posterior  distribution  given  by  equation  (6)  of 
that  chapter: 

Pj\9U\  g)  =  4-  exp[-c/H/;  sr)] 


with 


£M/;s)=i-M>(/) +  £*(/.  *) 

Jo  ies 


(3) 


Of  particular  interest  will  be  the  case  of  binary  fields  (M  =  2)  with  the  observations 
taken  as  the  output  of  a  binary  symmetric  channel  (BSC)  with  error  rate  £  (Gallager, 
1975),  so  that: 


for  gi  =  fi 
for  gi  ^  U 


In  this  case,  the  posterior  energy  reduces  to: 


V,A!; ,)  -  i  £  v(/„  /,)  +  a  £(i  -  6 (/.  -  s,)) 
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(4) 


where  /,  €  {91,92}; 


(«)  =  {« 

10, 


if  a  =  0 

otherwise 


a  =  1„(L-i) 


Note  that  in  this  case  (and  also  in  the  case  of  additive  white  Gaussian  noise),  by 
modifying  the  constant  Zi>,  and  applying  a  suitable  linear  transformation  to  the 
variables  {/;},  so  that  Q,  =  {-1, 1},  we  can  write  the  posterior  energy  in  the  form: 

VHf-.g)=^r  E  /./,  +  “  E/.S.  W 

which  corresponds  to  the  Hamiltonian  of  an  Ising  ferromagnet  coupled  with  a 
spatially  varying  external  magnetic  field  (whose  magnitude  is  proportional  to  g). 
The  importance  of  this  connection  is  twofold:  on  the  one  hand,  it  means  that  the 
tools  developed  for  die  equilibrium  behavior  of  these  systems  —  which  is  what 
the  estimation  process  is  about  —  may  be  relevant  for  the  physicists.  On  the  other 
hand,  it  is  conceivable  that  one  could  use  physical  ferromagnets  to  construct  special 
purpose  "quantum"  computers  that  could  solve  estimation  problems  at  atomic 
speeds. 

In  the  following  sections,  we  will  study  the  relative  performance  of  different 
Bayesian  estimators,  and  design  efficient  algorithms  for  approximating  them  in  some 
important  particular  cases. 

3.  Relative  Performance  of  Bayesian  Estimators  for  Binary  Fields. 


Once  the  posterior  energy  has  been  determined,  one  can  solve  the  reconstruction 
problem  by  finding  the  optimal  Bayesian  estimate  of  the  field  /.  As  we  discussed 
in  chapter  3,  however,  we  have  several  possible  choices  for  the  optimality  criterion. 
To  understand  the  differences  in  their  performance,  we  will  now  analyze  in  detail 
the  estimation  of  binary  fields,  when  the  observations  are  the  output  of  a  BSC  with 
error  rate  e. 


Li!*] 


Since  the  field  is  binary,  the  MPM  and  TPM  estimators  (defined  by  equations 
(13)  and  (14)  of  chapter  3,  respectively)  coincide.  The  question  is:  how  do  the 
performances  of  the  MAP,  and  say,  TPM  estimates  compare  with  respect  to  the 
error  criterion: 

g  =  *M/,  /)] 

with 

«.  =  £<!-<(/(-/;)) 

«'=1 

where  N  is  the  size  of  the  lattice,  and  the  expectation  is  taken  over  all  possible 
configurations  /  and  g. 

In  particular  we  are  interested  in  the  ratio: 

_  *MAP  _ 

ZTPM 

__  E/,g  exp [-Up{f\  g)]e,(/,  fMAp(g)) 

£/,g  exp[— 1//*(/ ;  g)]e,(/,  /rpAf(ff)) 

The  numerical  evaluation  of  this  expression  is  feasible  only  for  small  values  of  N. 

In  figure  8  we  show  a  plot  of  the  ratio  r  for  a  2  X  2  lattice,  for  different  values 
of  the  error  rate  c  and  the  natural  temperature  To.  As  expected,  r  is  never  less  than 
1.  In  the  worst  case  (for  e  =  0.1  and  To  ==  0.2)  the  error  of  the  MAP  estimate 
is  1.17  times  that  of  the  MPM  estimate;  if  T0  is  not  too  small  and  e  is  not  too 
large,  both  estimates  coincide,  and  as  e  approaches  0.5  (low  signal  to  noise  ratio), 
the  MPM  estimate  is  consistently  better  than  the  MAP.  An  experimental  analysis 
of  larger  lattices  reveals  a  similar  qualitative  behavior,  but  the  values  of  r  are  much 
larger  in  this  case  (see  table  1). 

3.1.  Example. 

We  now  return  to  the  example  presented  in  figure  6  of  chapter  3,  and  examine 
it  in  more  detail.  Panel  (a)  represents  a  typical  realization  of  a  64  x  64  Ising  net 
with  free  boundaries,  using  a  value  of  T0  ==  1.74  (0.75  times  the  critical  temperature 
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/ 

9 

/ MAP 

/ M  I>m{M -C.) 

Energy 

-5594.8 

-226.0 

-6660.9 

-6460.0 

-6427.0 

Seg.  Error 

— 

0.4 

0.33 

0.128 

0.124 

vy." 

4.  Exact  Algorithms  for  the  MAP  Estimator. 

From  the  discussion  of  the  previous  section,  it  is  clear  that  if  the  signal  to 
noise  ratio  is  not  too  low,  the  MAP  criterion  may  be  an  appropriate  choice,  if 
one  can  design  efficient  algorithms  for  computing  it.  As  we  will  now  show,  in 
the  case  of  one-dimensional  binary  fields,  one  can  in  fact  construct  an  algorithm 
which  computes  (exactly)  the  MAP  estimate  with  computational  complexity  which 
is  O(N)  (the  length  of  the  lattice)  in  a  serial  machine:  at  most  22 N  operations  are 
needed,  and  the  storage  requirements  are  also  O(N).  The  algorithm  can  also  be 
distributed  in  a  parallel  architecture,  making  its  execution  time  independent  of  the 
lattice  length. 

To  simplify  the  notation,  we  will  assume  that  /,■  G  (-1, 1}  for  all  t  (there  is  no 
loss  of  generality  in  this  asumption,  since  any  binary  process  can  be  brought  into 
this  form  by  a  reversible  linear  transformation).  Also,  assuming  the  noise  process  is 
stationary,  we  introduce  the  notation: 

*/,(».)  =  f Hf i, si) 

where  T0  is  the  natural  temperature  of  the  field. 

From  equations  (1)  and  (3),  it  is  clear  that  the  MAP  estimation  problem  is 
equivalent  to  the  minimization  of: 


W)  =  -  +  E  */.(«) 

t 


< 

-  -  . . 
•'  .->.1 


< 
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where  n  is  the  number  of  places  where  /,  7^  /,+ 1  (the  number  of  odd  bonds  of  the 
configuration).  From  this  expression,  it  follows  that  the  MAP  estimation  process  can 
be  reduced  to  the  problem  of  finding  the  optimal  value  for  n,  and  the  best  locations 
for  the  odd  bonds  (  which  we  will  also  call  "boundaries"  between  constant-valued 
blocks).  We  will  now  present  a  procedure  for  performing  this  task. 

Description  of  the  Algorithm. 

The  idea  in  which  this  method  is  based  is  the  following: 

We  start  scanning  the  sequence  {g,},  say,  from  the  left,  with  some  initial 
estimate  k  £  {-l,  1}  for  the  value  of  /  in  the  block  that  starts  at  l0  (a  pointer  that 
is  initially  set  to  1). 

Whenever  we  process  a  new  observation  gj,  we  ask  if  we  can  get  a  lower  energy 
by  putting  a  boundary  in  j  and  in  the  best  possible  location  /  within  the  interval 
[/0,y],  that  is,  we  ask  if: 

Ub  +  1  <  Up 

where 

Vf  =  £ 

»=/q 

£/»  =  l+ £  £  *_,(«.) 

* = to  t=/+l 

As  we  will  see  below,  the  optimal  boundary  location  l  (which  is  initially  set  equal 
to  /o)  needs  to  be  updated  only  if  the  conditions: 

SfL  7^  * 

tML  /  rML 
fj  /  Jj-l 

£  *+k(<7,)  -  *-k(9i)  <  £  *+fc(s?,-)  -  *-k(9i) 

i~l*  »=<0 

hold  simultaneously,  in  which  case  l  is  set  equal  to  3  -  1.  Here,  fML  denotes  the 
maximum  likelihood  estimate:  since  we  are  using  a  white  noise  model,  it  is  given 
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by: 


(ML 
J  i 


if  <  ^-i(</y) 

otherwise 


If  we  get  a  lower  energy  by  putting  a  boundary  at  /,  we  set  /,  =  k  for  i  e  [/<)»  l]\ 
update  the  value  of  the  pointer  /0  by  setting  it  equal  to  /  +  l,  and  set  the  new 
estimate  for  the  value  of  /,  in  the  block  that  starts  at  /0,  equal  to  -k. 

Otherwise,  we  just  set  fj  =  k,  and  continue  to  process  the  next  observation. 

When  we  reach  gs,  we  take  fs  as  the  initial  estimate  and  run  the  same 
process  backwards  to  get  the  final  solution  (in  fact,  one  can  show  that  it  is  possible 
to  make  this  backward  run  as  soon  as  we  get  the  second  boundary).  This  means 
that  we  can  implement  the  algorithm  in  a  distributed  fashion,  by  processing  in 
parallel  overlaping  subsequences  of  {$,},  provided  that  the  length  of  each  of  these 
subsequences  is  greater  than  twice  the  length  of  the  largest  constant-valued  block 
in  /.  The  final  solution  is  then  obtained  by  pasting  together  these  partial  estimates. 

Formally,  the  algorithm  is  as  follows: 


Definition  of  Variables. 
i :  Current  position. 

l0:  Pointer  to  the  beginning  of  the  current  region. 

1:  Current  optimal  location  of  the  boundary  in  the  interval  [f0>  *]• 
k:  Current  estimate  for  /((lo.lj). 

Up:  Energy  increment  associated  with  the  assignment  /([f0,*])  =  k. 

Um:  Energy  increment  associated  with  the  assignment  /([Z0, *])  =  —k. 

Ub:  Energy  increment  associated  with  the  assignment  f{[lo,l})  —  k-,f((l,i])  —  -k. 
si:  Best  local  (maximum  likelihood)  estimate  for  /,•. 
st ml:  Best  local  (maximum  likelihood)  estimate  for  /,•_ j. 
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Upi:  Energy  increment  associated  with  the  assignment  /([W])  =  &• 
Uml:  Energy  increment  associated  with  the  assignment  i])  =  -k. 
Utemp ■  Temporary  storage  register. 

M:  A  very  large  positive  number. 

/C0:  Switch  indicating  the  method  for  estimating  f,. 

Algorithm  Al(/C0): 


1:  Initialization. 

Set  l0  =  l  =  1;  Up  =  Um  =  Um,  =  0;  Ub  =  1;  Up,  =  M. 

Set  k  =  1,  if  K0  =  0  and  #  +i{gi)  <  #  -i(ffi)  ; 

-1,  if  K0  =  0  and  +i(ffi)  >  #  -i(gi)  ; 

K0t  if  K0  ^  0. 

Set  aiml  =  fc 

2:  Main  Loop:  For  i  from  1  to  N  do: 

Begin 

Set  si  =1,  if  ♦  <  ♦  -l [g*)  I 

-l,  otherwise. 

2.1:  See  if  the  optimal  boundary  location  needs  to  be  updated: 

If  (at  7^  k  and  at  ^  at'ml  and  Up  —  Upi  —  Um  +  Um,  <  0)  do 

Update  boundary  location: 

Set : 

/  =  *  - 1 
Up,  =  Up 

Uml  —  Um 
Uh  =  Up  +  1 


2.2:  Update  energy  increments: 

Set : 

UP  =  UP  +  *  +k(9i) 

Um  =  Um  +  #  -k{9i) 

Ub  =  Ub  +  V  - k{gi ) 

2.3:  See  if  a  new  boundary  has  to  be  introduced: 
\f(Ub  +  i  <  Up)  do  : 

Introduce  a  new  boundary: 

For  j  from  Iq  to  l  do  :  Set  /_,  =  k 
Set : 
k  =  -k 

/o  =  /  +  1 

uUmp  =  up-upl 
Up  =  um~  umi 

Un i  ==  Utemp 

Upi  —  M 
Ub  =  Um  +  1 

2.4:  Set  5tml  =  at 

End 

3:  See  if  the  last  boundary  has  to  be  introduced: 

If  (Ub  <  Up)  do  : 

3.1:  For  j  from  Iq  to  l  set  /,  =  k. 

3.2:  Set  lo  —  l  +  l. 

3.3:  Set  k  =  —k. 

4:  Fill  the  last  region: 

For  j  from  lo  to  N  set  /y  =  k. 

End. 
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The  proof  of  the  fact  that  this  algorithm  will  in  fact  find  the  global  minimizer  of 
(7)  is  presented  in  appendix  4.A. 


In  appendix  4.B  we  present  an  alternative  approach  to  this  minimization,  which 
is  based  on  dynamic  programming  ideas.  The  resulting  algorithm  is  less  efficient  than 
the  one  we  have  just  presented  for  the  case  of  binary  fields,  but  it  has  the  advantage 
of  being  extensible  to  handle  more  general  situations.  Also  in  this  appendix,  we 
compute  the  probability  distribution  for  the  number  of  odd  bonds,  and  discuss  the 
relationship  between  the  dynamic  programming  procedure,  and  the  use  of  linear 
filters  to  produce  multi-scale  descriptions  of  piecewise  constant  signals. 

5.  Estimation  of  Two-Dimensional  Binary  Fields. 

The  techiques  developed  in  the  last  section  for  the  exact  computation  of  the 
MAP  estimate  cannot  be  extended  to  the  two-dimensional  case;  the  main  difficulty 
here  is  that  the  geometry  of  the  boundaries  between  uniform  regions  (which  in  the 
one  dimensional  case  are  simply  points),  causes  a  combinatorial  explosion  of  the 
number  of  possible  configurations  compatible  with  a  given  total  boundary  length. 
The  question,  then,  is  whether  it  is  possible  to  find  algorithms  that  approximate 
the  optimal  estimates  (with  respect  to  the  selected  error  criterion),  that  are  more 
efficient  than  the  general  Monte  Carlo  procedures  presented  in  chapter  3. 

5.1.  MAP  Estimator. 

In  the  case  of  the  MAP  estimator,  the  efficiency  of  the  Simulated  Annealing 
algorithm  for  the  minimization  of  Up  can  be  improved  by  defining  large  "blocks" 
of  sites  (in  a  manner  that  is  reminiscent  of  the  "block-spin"  strategy  used  by 
Wilson  (1975)  in  connection  with  the  renormalization  group  approach  to  the  study 
of  critical  phenomena);  the  optimal  estimate  for  the  average  value  of  the  field  in 
each  of  these  blocks  is  found,  and  then  progressively  refined  by  subdividing  the 
blocks  in  successive  annealing  stages.  We  will  now  show  that,  if  we  use  a  maximum 
entropy  assumption,  the  structure  of  the  MAP  estimation  process  for  Ising  models 


is  invariant  under  the  "blocking"  transformation:  this  means  that  the  ground  state 
(i.e.,  the  MAP  estimator)  of  the  aggregated  process  (with  blocks  of  size  b)  also 
corresponds  to  that  of  an  (sing  model  with  a  coupled  external  Held,  in  which  the 
natural  temperature  is  scaled  by  a  factor  of  1/L,  and  the  noise  (coupling)  parameter 
by  a  factor  of  L2.  As  a  consequence  of  this  scaling,  the  final  temperature  for  the 
simulated  annealing  of  this  smaller  network  will  be  approximately  L  times  larger 
than  for  the  original  problem. 

Let  us  consider  a  binary  Ising  net  /  with  the  observations  taken  as  the  output 
of  a  binary  symmetric  channel  with  error  rate  t.  From  section  2,  we  know  that  the 
posterior  energy  will  be: 

Ur  =^Ev(fi,  li )  +  »E  «(/..  9i)  (8) 

J°  i.i  i 


,  ,  ( 0 ,  if  9i  =  fi 

fit  0«)  V  ..  . 

1.1*  if  9i  7^  fi 


Notice  that  equation  (8)  can  also  be  written  in  the  form: 


Up  =  £  Vc(fit  fj)  +  a  £  qc(fi,  9i) 


Totf 


where  Vc,qc  are  continuous  functions  satisfying: 


Vc{x,y)  =  V(z,y)  and 
qc{x,y)  =  q(x,y)  for  z,y€{0,l} 

We  will  now  derive  an  expression  for  the  energy  in  the  "block  spin"  case.  Let  us 
partition  the  original  lattice  L  into  square  blocks  of  side  L.  The  "block  observations" 
gL  will  now  be  the  density  of  l's  on  each  block,  i.e., 


where  D,  is  the  ith  block.  The  "block  field"  fi,  is  defined  in  a  similar  way. 

For  a  given  ftj,  we  compute  the  energy  by  assuming  a  maximum  entropy 
configuration,  which  occurs  when  the  Ts  that  correspond  to  the  given  density  fL(i) 
are  randomly  distributed  within  the  block.  The  energy  will  have  three  terms: 

1.  Interactions  between  adjacent  blocks: 

The  interaction  between  two  adjacent  blocks  i  and  j  will  be: 

Uj  —  [-“1  •  [P\\  +  Pm)  4- 1  •  (Pio  +  FbiJJ  •  L 

where  Pki  is  the  probability  of  having  an  element  with  state  k  on  block  i  adjacent 
to  an  clement  with  state  l  on  block  j : 

Pn  =  mfdi) 

Pox  =  h(j)(  1  -  h(i)) 

pyo  =  ram  -  fUj)) 

Poo  =  (1  -  /i(*))(i  -  Mi)) 

Substituting  these  values  we  get: 

Uj  =  L[2(fL(i)  +  -  4 -  1] 

2.  Interactions  within  each  block: 

This  term  depends  on  the  relative  frequencies  of  the  clique  configurations 
11,10,01  and  00  (ph,pio,Poi  and  Poo.  respectively)  on  each  block  (note  that  there 
are  2 L{L  -  l)  different  cliques).  Since  the  l’s  are  randomly  distributed  we  get: 

! 

pn  =  fd *)a 

pio  =  poi  =  A(t)(i  -  fd})) 
poo  =  (i  -  fd*))2 
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so  that  the  internal  interaction  A  is: 

A  =  2A(L-l)(-4/,,(x)2  +  4/,,(»)-l) 

3.  Interaction  with  the  observations: 

Assuming  that  the  l's  in  the  observations  and  in  the  field  are  independently 
distributed  we  get: 

WO  =  <*L2[/a(»)(  1  -  g/,(t))  +  (1  -  //.(*))»/.(*)]  — 

=  «^2[//.(0  +  01.(0  ~  2//.(0ff/.(0] 

Finally,  the  energy  takes  the  form: 

Udh)  =  ~Z  hj  +  U^h  +  W«))  = 

T“  ij  i  T» 

=  YMm + hU))  -  * mm  - 1] + 

Io  i,j 

+  ^iL  "  0  E(“4/i(*)2  +  4A(0  -  1)  + 

+  ?i(0  -  2/t,(t)ffL(*')} 

I 

note  that  the  sums  are  taken  over  pairs  adjacent  blocks,  and  over  all  the  blocks, 
respectively.  For  L  =  1,  this  expression  reduces  to  (8’)  with 

Vcifi,  fi)  =  2(/i  +  f})  -  4/,/y  -  1 
gc(a,  6)  =  a  +  6  —  2a6 
For  L  >  1,  the  quadratic  terms  of  Ul  are: 

e  mm  -  s(i  - 1)  e  aw2] 

Io  i,i 

and  since 

-2  e  midi)  +  2  e  aw2  =  E(aw  -  ami2  > » 
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il  follows  that 


£A(')2  >  EfiWdi) 


and 

-* £  //.(0//.0 )  -  8(L  -  i)  £  AW2  < 

*.J  * 

<  -(4  +  8(L-l))£/A(z)2  <  0 

ft 

which  implies  that  Ui,  is  negative  definite  for  L  >  1,  and  therefore,  its  minima, 
constrained  to  the  hypercube  [0, 1]^'-  (N/,  is  the  total  number  of  blocks)  will  always 
lie  in  a  corner  of  such  hypercube,  which  means  that  we  can  use  simulated  annealing 
to  find  the  global  minimum  of  Up,  constraining  the  search  to  {0,  In  this  case, 
the  energy  to  be  minimized  takes  the  simpler  equivalent  form  (up  to  an  additive 
constant): 

ul  =  ~  £  n/d>\  sum + “£2  £  iihui  «i(*)) 

"  tj  ft 

The  minimum  energy  solutions  for  each  L  can  be  interpreted  as  "coarse  scale" 
representations  of  the  original  pattern  /.  Once  a  solution  is  obtained,  the  next 
refinement  (for  blocks  of  size  L/2)  can  be  efficiently  obtained  using  the  previous 
solution  as  a  starting  point,  and  initiating  the  annealing  process  at  a  lower  temperature 
(the  MAP  estimates  presented  in  this  chapter  were  obtained  using  this  technique). 
At  present,  however,  we  do  not  have  a  good  method  (other  than  trial  and  error)  for 
determining  the  optimal  values  for  these  initial  temperatures. 

Also  in  this  connection,  the  work  of  Blake  (1983,  1985)  should  be  mentioned. 
This  author  proposed  the  minimization  of  an  energy  function  similar  to  Up  as  a 
pragmatic  criterion  for  restoring  piecewise  constant  images.  He  also  proposed  an 
algorithm,  based  on  the  successive  approximation  of  Up  by  a  family  of  convex 
envelopes  to  find  an  approximation  to  the  global  minimizer. 

The  relative  performance  and  computational  efficiency  of  these  various  schemes 
should  be  assessed  experimentally. 


5.2.  MPM  Estimator. 

In  the  case  of  the  MPM  estimate,  it  is  possible  to  construct  a  fast  deterministic 
algorithm  whose  experimental  performance  (in  terms  of  the  average  segmentation 
error)  is  equivalent  to  the  Monte  Carlo  method  discussed  above.  It  is  based  on  the 
following  ideas: 

First,  we  recall  that  for  a  binary  pattern,  the  MPM  and  TPM  estimates  coincide. 
We  will  approximate  the  posterior  mean  of  (3)  by  that  of  a  Gaussian  distribution 
r(;  with  the  property: 

Pc{h)  =  for  all  h  G  {0, 1}. 

Zp 

In  particular,  we  use: 

pcXh)  =  y-  exPb  Jr  Y.  52  (*»  “  hi )2  -  a  J2(hi  ~  9i)2}- 

10  ,  jeNi  . 


where 


Nt  =  {j€L  :  ||*  —  ;||  =  1}. 


For  this  distribution,  h  is  the  (unique)  minimizer  of  the  convex  function: 

w  =  fEE(v-  M2  +  «  B*<  -  9.)2 

,  ygyv,  .• 

which  corresponds  to  the  unique  fixed  point  of  the  system: 

L(*+ 1)  ZjeN*  hT  +  aTog' 


1  m  +  aTo 

We  could  now  approximate  our  estimate  by  putting: 

/.*  =  eft) 


-j1- 

lo. 


1,  if  x  >  5 


otherwise 


There  is  an  additional  consistency  condition  that  f  must  satisfy,  however.  It  can 
be  shown  that  when  the  posterior  distribution  has  the  form  given  by  (3)  and  (4), 
the  MPM  estimate  /,  which  by  definition  satisfies: 

pi\gCfii  9)  >  pi |,((l  -  /<);  9) 


also  satisfies: 

pi\g(fii  f)  >  Pi\g((l  (11) 


which  means  that  if  we  replace  the  observations  by  the  MPM  estimate,  and  compute 
a  new  MPM  estimate  for  this  modified  problem,  we  should  get  the  same  result  (the 
proof  is  included  in  appendix  4.C).  Translating  this  condition  to  the  case  of  f  ,  we 
get  that  it  must  satisfy: 

/.*  =  ©(*.*)  (12) 

where  h *  satisfies: 

.  E/e/v.  h)  +  aTo©(/»,) 

*"  \Ni\  +  aT0 

In  practice,  we  get  h *  as  the  fixed  point  of  the  system: 


.(l+l|  s,-ew,  h‘/k)  +  <*T0e(h‘M) 

■'  |Ni|  +  aTa 


(13) 


with 

h*<°)  =  h 


Note  that  the  function: 


Uh(h)  =  E  (fci  ~  hj)2  +  aT0  Y,(hi  ~  0(hi)f 

t 


acts  as  a  Lyapunov  function  for  the  system  (13),  which  is  therefore  (locally)  stable 
(Vidyasagar,  1978). 

This  algorithm  can  be  visualized  as  operating  in  two  steps:  In  the  first  one, 
we  extract  all  the  information  that  we  need  from  the  observations  and  encode  it  in 


80 


h  (which  is  continuous-valued),  and  in  the  second  one,  we  find  the  closest  binary 
pattern  that  satisfies  the  consistency  condition  (11). 


To  illustrate  the  performance  of  this  approximation,  we  show  f  ,  for  the 
example  discussed  above,  in  panel  (e)  of  figure  1,  and  its  corresponding  energy  and 
segmentation  error  in  the  last  column  of  table  1  (labeled  "MPM  det."). 

5.2.1.  Parallel  Implementation. 


The  dynamical  systems  defined  by  equations  (9)  and  (13)  can  be  implemented 
directly  in  a  parallel  architecture,  such  as  the  "Connection  Machine",  by  assigning 
a  processor  to  each  site,  and  updating  the  state  of  all  sites  at  the  same  time. 
Each  update  will  require,  for  both  systems,  at  most  10  (16-bit)  additions  and  two 
multiplications,  that  is,  a  total  of  672  cycles  of  a  1-bit  processor.  We  have  found 
experimentally  that  in  most  cases,  less  than  50  iterations  of  (9),  and  100  of  (13)  are 
needed  for  convergence,  so  that,  using  the  figures  of  chapter  3,  we  estimate  the 
total  execution  time  as  approximately  0.1  seconds,  an  improvement  of  one  order  of 
magnitude  over  the  general  Monte  Carlo  procedure  described  in  that  chapter. 


5.3.  Analog  Networks. 


Hopfield  and  Tank  (1985)  (see  also  Hopfield,  1982  and  1984)  have  studied  the 
behavior  of  "neural"  analog  networks  of  non-linear  amplifiers  interconnected  by 
resistors,  whose  dynamics  can  be  described  by  the  differential  equations: 


dU{ 

~dt 


=  £ 

j£N<  T 


(14) 


fi  =  ©(«») 


Here,  TV,-  is  the  neighborhood  of  node  i\  u,-  and  /,  denote  the  input  and  output 
voltage  of  the  ith  amplifier;  Ti}  is  the  conductance  of  the  link  between  the  nodes 
i  and  j:  /,  is  a  fixed  current  injected  at  node  t,  and  r,  a  constant  depending  on 
the  internal  resistance  and  capacitance  of  each  amplifier.  The  gain  function  of  the 


amplifiers,  0(  )  is  chosen  as  a  sigmoid  function  that  restricts  the  output  to  the 
interval  (0, 1),  and  has  a  form  similar  to  the  observed  response  of  biological  neurons 
(hence  the  term  "neural").  In  particular,  one  can  put: 


9{u)  = 


1 

1  +  exp(— 0u] 


(15) 


where  /?  is  called  the  "gain  parameter". 

These  researchers  have  proved  that  the  system  (14)  is  always  stable,  provided 
we  have  T,-;  =  T,,  for  all  i,j,  and  in  the  high  gain  limit  (for  0  >>  1),  the  stable 
fixed  points  will  be  local  minima  of  the  "energy"  function: 

(16) 

1  »'./  » 


Note  that  we  can  write  (14)  as: 

du{  BE  Ui 

dt  dfi  r 

fi  =  e(u.) 


(18) 


They  have  also  pointed  out  that  if  one  uses  the  gain  function  (15),  the  fixed 
points  of  (18)  will  satisfy: 


1  +  exp[-0r  //,(/)] 


(19) 


with 

Hi{f)  =  -  jr  =  £  Tijfj  +  U  (20) 

aji  idNi 

These  equations  will  also  be  satisfied  by  the  mean  field  approximation  (see  Reif, 
1965)  to  the  ensemble  averages  of  a  binary  process  /  (/,  €  {0, 1})  with  respect  to 
the  Gibbs  measure  generated  by  the  energy  (16)  at  a  temperature  T  =  l//?r.  This 
can  be  shown  as  follows: 

The  mean  field  approximation  is  obtained  by  assuming  that  the  local  energy  at 
node  *,  which  is: 


Ei(f)  =  -/,[  £  Tijfj  +  /,]  =  -hKAf) 
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can  be  approximated  by: 


Ei  «  -/,[  E  +  U\  =  -fiHiCf) 

ieNi 

where  7,  denotes  the  ensemble  average  of  Since  {/>}  are  constants  for  a  given 
temperature,  we  can  compute  /,•  as: 

-  E/, =(»,(/, exp[-Ht(7)/r] 

EA=0,1  exp[-//,(7)/r] 

_ 1 _ 

~  l  +  exp[-/f,(7)/r] 

This  means  that  there  is  a  fixed  point  of  equation  (18)  that  can  be  interpreted  as  an 
approximation  of  the  ensemble  average  of  a  corresponding  binary  MRF  (note  that 
in  general  this  fixed  point  will  not  be  unique,  and  will  depend  on  the  selection  of 
the  initial  conditions;  the  lack  of  an  adequate  criterion  for  making  this  selection  in 
the  general  case  represents,  at  this  point,  a  serious  limitation  of  this  approach). 

In  the  case  of  the  posterior  energy  (4),  if  we  require  that  /,  G  {0, 1},  we  can 
write  it  in  the  equivalent  form  (up  to  an  additive  constant): 

upU)  -  -4-  E  E  Mi  -  EI-^P  +  “(2«  - !)!/( 

J°  »  j€Nt  i 


so  that 


dUP 

dfi 


=  E  fj  +  a(29i  ~  !)  - 

io  jeNi 


m\ 

To 


In  this  form,  one  can  construct  directly  the  system  (18),  and  defining  the  initial  state 
as  /|0'  =  =  0.5  for  all  t,  find  the  stable  fixed  point  that  will  approximate  /. 

Since  for  a  binary  system  the  MPM  and  TPM  estimators  are  equivalent,  we  can 


approximate  the  optimal  estimate  by: 


fo,  if?,  <0.5 

ll,  otherwise 


We  have  performed  digital  simulations  of  the  system  (18),  and  have  found  very 
good  performances  for  relatively  high  signal  to  noise  ratios.  For  high  error  rates. 


the  behavior  of  this  approximation  is  similar  to  that  of  the  MAP  estimator.  We  will 
have  to  say  more  about  this  approach  at  the  end  of  the  next  chapter. 

6.  Simultaneous  Estimation  of  the  Field  and  the  Parameters. 

To  apply  the  estimation  procedures  described  in  the  previous  sections,  the 
parameters  that  characterize,  both  the  prior  model  of  the  Held  (the  natural  temperature 
T0),  and  the  noise  process,  (the  error  rate  e ,  or  the  variance  <r2)  have  to  be  known. 
In  most  practical  cases,  however,  we  are  only  given  the  noisy  observations  g  and 
general  qualitative  information  about  the  structure  of  the  field  and  the  noise,  so  that 
/,  a  (which  stands  for  either  log[(l  —  c)/cj  or  a)  and  To  have  to  be  simultaneously 
estimated. 

In  principle,  one  could  use  again  a  Bayesian  approach,  and  assuming  prior 
independent  uniform  distributions  fora  and  T0  (in  the  ranges  [a0, a1]  and  [Tg,T^], 
respectively),  find  those  a,  T0  and  f  which  jointly  maximize  the  posterior  distribution: 


P(f,a,To\g)  = 


exp[— Ur(a,T0,f)} 

(aJ  -  a°)(n  -  T")Z(T0)Pg(g) 


The  main  difficulty  here  is  the  extraordinary  computational  complexity  of  the 
partition  function: 

2(T0)  =  £exp[-i- £/<>(/)] 

/  io 

which  makes  this  approach  impractical,  except  for  very  small  lattices. 

An  alternative  approach  is  based  on  the  following  considerations  (we  will  study 
in  detail  the  case  of  a  BSC;  other  noise  models  can  be  analyzed  in  a  similar  way): 

Equations  (9)  and  (13),  which  describe  the  deterministic  approximations  to 
fMPM  depend  on  the  parameters  of  the  system,  e  and  T0)  only  through  the  product: 


7  —  “To  =  To  log 


which  means  that  the  behaviour  of  the  algorithm  is  completely  characterized  by  the 
single  parameter  7.  In  the  case  of  the  Monte  Carlo  approximation,  if  we  fix  the 


r  "tr  ” » 


value  of  7,  the  value  of  To  cannot  be  chosen  arbitrarily,  since  it  has  to  satisfy  the 
consistency  condition: 


with 


a  =  log 

«  =  jr  £  (22) 

iv  »=1 


where  z  is  the  residual  process  defined  as: 


Zi  — 


if  fi  7^  St- 
otherwise 


(23) 


This  means  that,  given  7,  the  correct  value  of  T0  can,  in  principle,  be  determined 
in  an  adaptive  way,  so  that  in  this  case  too,  the  behaviour  of  the  approximation 
depends  effectively  only  on  7. 


For  a  given  value  of  7,  we  can  approximate  the  corresponding  MPM  estimate 
/  using  the  methods  developed  in  the  previous  section,  and  compute  the  residual 
process  2  and  the  conditional  (on  7)  Maximum  Likelihood  Estimate  of  the  error 
rate  e  using  equations  (22)  and  (23).  The  corresponding  conditional  estimate  for  T0 
will  be: 

To-?  (24) 

a 

To  measure  the  "likelihood"  of  the  estimate  /,  we  use  the  degree  of  uniformity 
(or  "whiteness")  of  the  residual  process  z.  This  property  can  be  quantified  by  the 
variance  of  the  local  noise  density,  which  we  estimate  as  follows: 


We  cover  the  lattice  with  a  set  (Sy)  of  m  non-overlapping  squares  (say,  8 
pixels  wide).  For  each  square  Sy,  the  relative  variance  of  the  noise  density  is: 

.2 


-Pf-7 


(25) 


8S 


where  |Sy|  is  the  area  of  the  jth  square. 
The  desired  likelihood  function  is: 


i{f)  = 


~  £  °i 
i=  i 


which  is  equivalent  to  a  x2  criterion  (Cramer,  1946)  normalized  to  take  into  account 
the  sample  size. 

Alternatively,  one  can  use  directly  the  likelihood  that  the  residuals  come  from 
a  uniform  distribution.  To  compute  it,  we  note  that  the  quantities: 

Uj  =  £  *» 
i£Sj 

are  distributed  according  to  the  multinomial  law: 

n!  /  1  \" 


P{y\,  •  •  I'm)  =  — r~ — j(— ) 

i'll .  MmWrnJ 


n  as  Nl  =  1/1+..  .i/m 

Using  the  Stirling  approximation  we  get  the  log-likelihood: 

m 

Ul'l,-  ■  ;  I'm )  =  log  P[VI,  .  .  .,  I'm)  -  £  I \  log  I/,-  + 

*=l 


+nlog[— ]  +  |log( — )  +  K 
\mj  2 


where  K  is  a  constant.  We  have  found  experimentally  that  both  likelihood  measures 
(26)  and  (27)  have  a  similar  behavior  when  n  is  large.  When  n  is  relatively  small, 
or  when  for  some  »,  i/t  =  0,  however,  (26)  is  preferable,  and  so,  it  is  the  one  we 
adopt 


■*.'  *  '  '.Vs' V  */%  o  < 
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Note  that  a  more  conventional  likelihood  function,  such  as  the  conditional 
likelihood  proposed  by  Besag  (1972),  will  not  work  in  this  case;  this  function  is 
defined  as: 

L(/)  =  M/l  +  M/)  with 

m 

LkCf)  =  n  p0i  I  fjj  e  NitTo)  = 

»e  Ck 


=  n 

iec\ 


exp  [-fEye^n/, ',/,)] 


expf- J-  EyeM  ^(/,»  4)]  +  exp[“i  £ye/v,  ^(1  -  ft,  fj) } 


=  n(i+exp(—  t,  vCfi>fj)\) 

i£Ck  To 


where  the  "codings"  C\  and  C 2  are  the  sets: 


Ci  —  {*  (1,  is  odd  and  y,  is  even)  or  (z,  is  even  and  y,-  is  odd)} 

C2  =  {*'  :  (z,  is  odd  and  y,is  odd)  or  (z,  is  even  and  y,  is  even)} 


with  (z,,y,)  denoting  the  row  and  column  indices  of  site  i  (notice  that,  given  the 
value  of  the  field  at  the  sites  of  Clf  the  random  variables  associated  with  any  pair 
of  sites  of  Ci  become  independent,  and  viceversa).  In  our  case,  we  find  that  as  7 
decreases,  f  becomes  more  and  more  uniform,  while  T0  remains  almost  constant 
It  is  not  difficult  to  see  that  as  a  result,  the  conditional  likelihood  L  will  decrease 
monotonically  with  7,  which  renders  it  useless  for  our  purpose. 

The  range  of  values  [70,  7a#]  of  the  parameter  7  that  corresponds  to  the  class 
of  systems  of  interest  can  be  determined  as  follows: 

One  can  show  that  for  7  >  8  we  will  always  have  }mpm%  —  S»  f°r  *.  so  that 
we  can  use  7^  =  8.  The  value  of  70  can  be  obtained  from  an  upper  bound  for  e 
and  a  lower  bound  for  T0.  For  example,  assuming  that  e  <  .45  and  To  >  .5 Te,  we 
get  70  =  .23.  (Note  that  when  the  natural  temperature  To  of  a  first  order,  isotropic 
MRF  is  below  0.5  times  Te  (the  critical  temperature  of  the  lattice;  see  Kindermann 
and  Snell,  1980),  the  patterns  become  practically  uniform  (i.e.,  /,  =constant  for  all 
t),  while  for  values  of  To  greater  than  1.5TC,  we  get  patterns  that  are  practically 
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indistinguishable  from  white  noise.  Therefore,  the  assumption  T0  >  .5 Tc  covers 
practically  all  the  interesting  cases). 

The  complete  estimation  procedure  is  as  follows: 

Maximum  Likelihood  Estimation  Algorithm: 

1:  Sample  the  interval  [-70,  im)  at  the  points 

70  <  7l » •  •  *7n  <  1M 

2:  For  each  7  €  Q  =  {71,  •  ■  -7n}  : 

2.1:  Find  7(7)  using  (9)  and  (13). 

2.2:  Compute  z  using  (23). 

2.3:  Compute  e  using  (22).  If  e  =  0,  set  C(/(t))  =  -c»  and  proceed  with  the 
next  value  of  7.  Otherwise,  compute  a  and  go  to  2.4. 

2.4:  Compute  7b  using  (24). 

2.5:  Compute  L(/( 7))  using  (25)  and  (26). 

3:  Compute  the  optimal  estimate  f  using: 

f  -Kl)  :  a/(7*))  =  sup  a/(7))  (28) 

•  *  * 

The  corresponding  c  ,  T0  will  be  the  optimal  estimates  for  e  and  To,  respectively. 

Remarks: 

1.  This  estimation  algorithm  allows  us  to  reconstruct  a  binary  pattern  /  from 
the  noisy  observations  g  without  having  to  adjust  any  free  parameters .  The  only 
prior  assumptions  correspond  to  the  qualitative  structure  of  the  field  /  (first  order, 
isotropic  MRF)  and  to  the  nature  of  the  noise  process,  but  neither  the  natural 
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figure  9.  (a)  Synthetic  image,  (b)  Noisy  observations,  (c)  Maximum  Likelihood  Estimate,  (d)  A 
complete  series  of  estimates.  The  optimal  estimate  (for  7  =  2.9)  is  indicated  by  an  arrow. 


temperature  T0  nor  the  error  rate  «  have  to  be  known  in  advance.  In  practice,  this 
means  that  we  can  apply  it  to  restore  any  binary  image  with  uniform  granularity, 
even  if  it  has  not  been  generated  by  a  Markov  random  process.  We  have  used  this 
algorithm  to  reconstruct  a  variety  of  binary  images  with  excellent  results.  In  figure  9 
we  show  such  a  restoration.  The  observations  (b)  were  generated  from  the  synthetic 
image  (a)  with  an  actual  error  rate  of  .35  (assumed  unknown).  The  MLE  for  /  is 
shown  in  (c).  A  complete  series  of  estimates  with  7  varying  from  .5  to  3.5  is 
shown  in  panel  (d). 


2.  This  procedure  can  be  easily  extended  to  handle  any  one-parameter  noise 
corruption  process  (such  as  zero  mean,  additive  white  Gaussian  noise).  The  extension 
to  the  case  of  N-ary  fields,  i.e.,  to  the  restoration  of  piecewise  constant  images, 
is  also  straightforward  (using  the  general  algorithm  described  in  chapter  3  instead 
of  (9)  and  (13)  in  step  2.1).  As  an  example,  we  present  in  figure  10  the  optimal 


restoration  of  a  ternary  pattern  corrupted  by  additive  white  Gaussian  noise. 

3.  We  have  found  that  the  likelihood  function  (26)  is  reasonably  well  behaved  as 
a  function  of  7.  This  permits  us  to  perform  the  one-dimensional  search  for  its 
supremum  in  an  economical  way,  by  first  determining  its  approximate  location  using 
a  coarse  sampling  pattern,  and  then  refining  its  position  by  a  fine  sampling  of  a 
reduced  interval.  In  practice,  it  is  possible  to  get  very  good  results  using  on  the 
order  of  15  samples. 

4.  The  whole  procedure  is  highly  distributed,  so  that  it  is  possible  to  implement  it 
in  parallel  hardware  in  a  very  efficient  way. 

7.  Formation  of  Perceptual  Ousters. 

At  the  heart  of  a  general  purpose  perceptual  system,  one  must  have  a  mechanism 
for  deciding  which  parts  of  an  image  should  be  considered  to  "belong"  together 
(Marroquin,  1976).  A  simple  instance  of  this  problem  is  the  grouping  of  doF  in  an 
image  into  perceptual  clusters.  Some  heuristic  schemes  have  been  proposed  to  model 
this  phenomenon  (see  for  example,  O’Callahan,  1974).  We  will  show,  however,  how 


this  problem  can  be  formulated  in  an  elegant  way  that  is  also  biologically  motivated, 
as  a  particular  case  of  the  reconstruction  of  binary  patterns  from  noisy  observations. 

The  conceptual  model  for  this  formulation  is  as  follows: 

Let  us  consider  the  dots  that  form  the  original  pattern  as  patches  belonging  to 
some  objects  of  uniform  color  that  are  partially  hidden,  say,  by  some  foliage.  In 
this  way,  the  formation  of  clusters  is  equivalent  to  the  problem  of  reconstructing 
these  objects  (whose  cohesive  nature  is  modeled  by  a  first  order  MRF  with  Ising 
potentials)  from  observations  that  are  formed  according  with  the  following  model: 

Suppose  that  /,  =  1  only  if  an  object  overlaps  the  ith  site  of  the  lattice.  We 
assume  that  the  "foliage"  will  hide  this  point  (i.e.,  make  g,  =  0)  with  probability 
c,  and  that  spurious  values  of  3,  =  1  can  appear  in  sites  where  /,-  =  0  with  a  very 
small  probability  p: 

1,  with  prob.  (1  -  e),  if  /,•  =  1 

0,  with  prob.  e,  if  /,  =  1 

0,  with  prob.  (1  -  p),  if  /,  =  0 

1,  with  prob.  p,  if  /,•  —  0 

with  p  <  <  1.  The  posterior  energy  is: 

UP(f; g )  =  ±-U0{f)  +  a  £  (1-<S(1- *))  + 
r0  ,:/,=! 

£  (1-%))  (29) 

»':/<=  0 

where  Uo{f)  is  given  by  (1)  and  (2): 

-1,  if  \i  -  ;|  =  1  and  /,•  =  /y 

Vc{fi,  fj)  =1,  if  \i  -  j\  =  1  and  /,•  ^  Sj 

.0,  otherwise 

m)  =  zn/i-fi) 

6  and  a  are  defined  in  (5)  and  (6): 

1,  if  o  =  0 
0,  otherwise 
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and  M  is  a  very  large  number. 


The  clustering  task  is  now  equivalent  to  the  problem  of  estimating  /  and  the 
parameters  a  and  To  from  the  noisy  observations  g.  To  accomplish  this,  we  can  use 
the  method  described  in  the  previous  section,  except  that  in  this  case,  only  those 
sites  for  which  /,■  =  1  will  be  useful  for  the  estimation  of  the  residual  density  and 
its  local  variance.  This  means  that  equation  (22)  has  to  be  modified  to: 


e  = 


where 

A  —  {*  :  fi  =  1} 

and  z,  is  defined  in  (23).  Also,  the  sets  Sy  used  to  compute  the  relative  variance  of 
the  residual  density  in  (25)  should  now  be  taken  as  the  intersection  of  the  squares 
that  cover  the  lattice  with  the  set  A. 

With  these  modifications,  the  Maximum  Likelihood  algorithm  can  be  used 
for  clustering.  Its  performance  is  illustrated  in  figure  (11)  where  we  show:  the 
original  dot  pattern  (upper  left)  and  the  recontructed  objects  for  decreasing  values 
of  7  =  aT0.  The  maximizer  of  the  likelihood  is  marked  with  an  arrow.  We  believe 
that  these  preliminary  results  are  encouraging,  although,  clearly,  more  numerical 
and  psychophysical  experiments  are  needed  to  assess  the  plausibility  of  this  scheme 
to  model  human  perceptual  processes. 

8.  Discussion 

In  this  chapter  we  have  addressed  the  problem  of  reconstructing  piecewise 
constant  functions  from  noisy  observations.  We  showed  that  the  optimal  solution 
to  this  problem  can  be  obtained  from  the  observation  of  the  equilibrium  behavior 
of  a  generalized  Ising  net  coupled  with  a  spatially  varying  (but  fixed  in  time) 


Figure  11.  Formation  of  perceptual  clusters.  We  show:  the  original  dot  pattern  (upper  left) 
and  die  rccontaictcd  objects  for  decreasing  values  of  7  =  aT0.  The  maximum  likelihood  estimate 
(i.e.,  the  optimal  clustering)  is  marked  with  an  arrow. 

external  field.  If  we  use  the  minimization  of  the  expected  segmentation  error  as 
a  criterion,  the  optimal  estimate  is  the  maximizer  of  the  posterior  marginals  (  the 
MPM  estimator  which  was  described  in  chapter  3). 

We  compared  the  relative  performance  of  die  MAP  and  MPM  estimators,  and 
found  that  for  moderate  signal  to  noise  ratios,  they  are  practically  equivalent,  but 
as  die  noise  level  increases,  the  MPM  estimate  is  (sometimes  dramatically)  superior. 
A  consequence  of  this  analysis  is  that,  if  the  noise  'evel  is  not  too  high,  the  MAP 
estimator  may  be  a  reasonable  choice  in  those  cases  where  it  is  computationally 
advantageous.  This  is  the  case,  for  example,  of  the  reconstruction  of  one-dimensional 
binary  signals,  where  we  derived  a  very  efficient  algorithm  for  its  exact  computation. 

In  the  two-dimensional  case,  however,  the  situation  is  different:  the  general 
Monte  Carlo  procedure  for  the  approximation  of  the  MPM  estimator  is  in  fact  more 
efficient,  from  a  computational  viewpoint,  than  the  corresponding  one  for  the  MAP 
(Simulated  Annealing),  and  in  the  case  of  binary  fields,  we  derived  a  much  faster 
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deterministic  scheme  with  excellent  experimental  performance. 

We  also  showed  how  these  estimation  procedures  can  be  extended  to  the  more 
interesting  case  where  the  parameters  of  the  system  are  not  known  in  advance.  In 
this  case,  a  maximum  likelihood  estimation  algorithm  can  be  derived,  which,  using  a 
likelihood  function  that  is  computed  from  the  residuals,  allows  for  the  simultaneous 
estimation  of  the  field  and  the  parameters. 

We  point  out  that  although,  for  the  sake  of  simplicity,  we  have  concentrated 
on  the  case  of  binary  fields  sent  through  binary  symmetric  channels,  the  results  that 
we  have  presented  can  be  generalized  to  N-ary  fields  and  other  noise  models  (see 
figure  10). 

The  constructions  that  we  have  presented  can  be  applied  not  only  to  image 
segmentation  and  restoration,  but  to  other  problems  as  well.  As  an  illustration, 
we  presented  a  novel  application  to  the  modeling  of  the  process  of  formation  of 
perceptual  clusters.  Another  important  problem  that  can  be  formulated  in  this  way 
is  the  recontruction  of  surfaces  from  stereoscopic  pairs  of  images;  we  will  discuss  it 
in  detail  in  chapter  6. 


Appendix  4.A 


OPTIMALITY  OF  ALGORITHM  A1 


In  this  appendix  we  present  a  proof  of  the  fact  that  the  algorithm  presented  in 
section  4  of  chapter  4,  effectively  computes  the  MAP  estimate  for  a  one-dimensional, 
binary  MRF. 

The  optimality  of  the  algorithm  follows  from  the  following  propositions: 

Proposition  I:  Let  5*  =  be  the  optimal  boundary  configuration,  and 

suppose  that  lk,  for  it  <  n  was  detected  by  Al.  Then,  Ik+1  will  be  the  next  boundary 
detected  by  Al. 

Proof: 

Suppose  lk  was  detected  by  Al,  and  let  L  be  the  next  boundary  detected.  We  will 
assume  that  L  ^  lk+x  and  arrive  at  a  contradiction.  We  will  consider  three  cases: 

Case  1:  Suppose  Al  detects  L  at  j  <  lk+ 1. 

Then,  we  must  have  that 

Up{ j)  >  UP(L)  +  Um(j)  -  Um(L)  +  2 

and  therefore, 

U({lu...,lk,L,j,lk+ 1,...})  <  U(S*) 
which  is  a  contradiction. 

Case  2:  Suppose  Al  detects  L  at ;  e  (4+i>f*+a]‘ 

This  means  that  at  j  we  had  that  L  was  the  optimal  location  for  the  boundary.  In 
particular, 

Up{lk+l)  +  ^m(j)  —  Um{lk+ 1)  >  ^p(^)  +  ^ rn{] )  ~  Um[L) 


which  implies  that 


UP(L)  +  Um(lk+2)  -  Um(L)  <  Up{lk+l)  +  Um(lk+2)  -  Um(lk+ ,) 


and  therefore. 


U({lu...,lk,L,lk+2,...}  <  U(S*) 


which  is  a  contradiction. 

Case  3:  Suppose  that  Al  has  not  detected  any  new  boundary  at  j  =  lk+2  +  1. 


Then,  we  must  have: 


which  means  that 


Up(h+ 2  +  1)  <  Ub{lk+ 2  +  1)  +  1 


which  is  again  a  contradiction.  | 

Proposition  2:  If  Al  runs  from  left  to  right  starting  at  a  point  /0.  and  generates 
the  boundaries  then,  lj  e  S *  (the  set  of  boundaries  of  the  optimal 

configuration)  for  j  >  2. 

Proof: 

Let  /*,  fM  be  the  optimal  configuration,  and  the  one  generated  by  Al,  respectively. 


Lo  =  s«p{>  G  S*  :  j  <  li) 

L  =  inf {;  G  S*  :  j  >  h} 

If  Lo  —  lo,  we  apply  proposition  1  and  finish  the  proof;  so,  let  us  assume  that 
Lo  7^  lo,  and  that  /i  was  detected  at  t.  We  have  two  cases: 

Case  1:  Lo  >  10-  We  claim  that  in  this  case,  l\  €  S',  and  therefore,  by  proposition 
1,  lj  £  S'  for  j  >  1.  To  prove  this  claim,  we  consider  two  subcases: 

Case  1-a:  /*((l0,Lo))  fM{{lo,Lo)). 


In  this  case,  we  have: 


2  +  f/m(i)  -  £/„,(/,)  +  Up(h)  <  cy*) 

and  therefore, 

2  +  E/m(*)  -  Um(l\)  +  £/„(/,)  -  C/p(£«)  <  f/p(t)  -  C/P(A,) 

which  implies  that  li  e  5*. 

Case  1-b:  /*((!<>,  A>))  =  /m((Iq,Lo)). 

Suppose  / 1  g  5*.  We  have  that,  at  location  t, 

Up(h)  +  Um(i)  -  Um{h)  +  2  <  C/p(Io)  +  Um(t)  -  Um(Lo)  +  2 

since  otherwise.  A)  would  have  been  a  better  location  for  the  boundary.  However, 
this  implies  that 

£/„(/,)  +  Um{L)  -  Um{lx)  <  Up(Lo)  +  Um(L)  -  Um(h) 

which  means  that  we  can  improve  5*  by  moving  Lq  to  h,  which  is  a  contradiction. 
Case  2:  Lq  <  l0. 

Again,  we  consider  two  subcases: 

Case  2-a:  f\(Lo,l0))  = 

Let  U+,U-  be  the  energy  increments  with  respect  to 

CM*)  =  £ 

i-U 

u-{i)  =  t  *-*(») 

j=U 

Note  that 


Up(i)  =  U+{i)-U+{l0)  and 
Um(t)  =  [/_(*)  -U.(l0) 


Since  1 1  was  detected  at  t,  we  have: 


2  +  Um(t)  -  Um(l, )  +  t/p(M  <  Up(i) 

and  therefore, 

2  +  -  l7_(/i)  +  £400  <  £4(0 

which  means  that  ii  €  5*. 

Case  2-b:  f‘((Lo,to))  ^  /ai((Lo,  lo)). 

Using  the  same  definitions  for  U+,U _,  we  have  that,  by  the  optimality  of  S*,  for 
some  j  >  L, 

U-U )  -  +  2  <  U+(j) 

and  therefore. 


C7_(j)  -  U-(L)  +  U+(L )  -  +  2  <  tf+(L)  -  t/+0i) 

which  means  that  if  Al  detects  llt  it  must  detect  L  too,  unless  it  detected  i2  first, 
but  in  this  case  we  have  that,  for  some  p  <  j , 

U-(p)  -  U-(l2 )  +  U+(h)  ~  U+(t 0  +  2  <  £4(p)  - 

which  implies  that  Z2  G  5*.  This  completes  the  proof.  | 

It  should  be  clear  that  these  results  can  be  easily  extended  to  the  case  where 
Al  runs  backwards  (from  right  to  left).  With  this  extension,  we  get  the  following 
complete  optimal  procedure: 

Algorithm  A2: 


1:  Run  Al  from  left  to  right.  Detect 

2:  Run  Al  backwards  (starting  from  l2).  Get  either 

{l2,...,ln}  or  0i 


In  either  case,  this  is  the  optimal  solution. 

The  only  thing  that  remains  to  be  proved  is  that  the  determination  of  the 
optimal  location  for  a  boundary  is  in  fact  performed  by  step  2.1  of  Al.  We  have 
the  following: 


Proposition  3:  Suppose  that  Al  detected  a  boundary  at  (or  started  from)  l0.  Then, 
the  optimal  location  /  of  the  next  boundary  has  to  be  updated  only  at  places  where 
si  =  -k  and  st ml  —  k  (note  that  in  si  we  have  stored  the  value  of  the  maximum 
likelihood  estimate  f^L,  while  siml  =  /"[').  Suppose  i  is  one  such  place.  The 
optimal  location  will  be: 

f»-l.  if  Up(i  -  1)  -  Um{i  -  1)  <  Upl~Uml 
\l,  (the  current  value)  otherwise 


Proof: 

First,  we  note  that  a  necessary  and  sufficient  condition  for  l  to  be  the  optimal 
location  of  the  boundary  at  the  point  i  is  that,  for  j  €  [lo,i  -  1]: 

UP{1)  +  Um(i)  -  Um(l)  <  Up{j)  +  Um[i)  -  Um(j) 

or  equivalently, 

UP(l)  -  Um(l)  <  Up(j)-Um(j) 

Suppose  l  was  the  optimal  location  at  *  —  1,  and  we  process  observation  t.  We 
consider  several  cases: 

Case  1:  siml  =  ~k 

In  this  case,  we  show  that  l  remains  the  optimal  location: 

By  construction,  we  have  that: 

Up{i  -  1)  =  Up(i  -2)+  *+fc({t,_,) 

Cm(t  —  1)  =  Um(i  —  2)  +  *(s?«— i) 
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Since  siml  =  -k  we  have  that. 


and  therefore, 


Up{i  -  1)  -  um(i  -  1)  =  up(i  -  2)  -  Um(i  -  2)  +  *+k{gi- ,)  -  *-fc(g<-i)  > 

>  tfp(i  -  2)  -  Um(i  -  2)  >  UP{1)  -  Um(l) 
so  that  /  remains  the  optimal  location. 

Case  2:  siml  =  A; 

In  this  case  we  have  that 


Up[i  -  1)  -  Um(i  -  1)  <  Up(i  -  2)  -  Um(i  -  2) 


This  means  that  the  minimal  value  for  Up(i)  -  Um(i)  on  a  block  for  which  si  —  k 
will  be  obtained  at  die  extremal  point  where  st  =  —k  and  siml  =  k,  and  since, 
by  theorem  1  of  appendix  4.B,  this  is  the  only  point  where  a  boundary  might 
be  placed,  it  is  sufficient  to  update  the  optimal  location  only  at  these  points.  So, 
suppose  siml  =  k  and  si  =  —k. 


If 


UjU-Umi  <  Up{i  —  1)  —  Um(t  —  1), 


then, 

Upi  -  Umi  <  up(j)  -  um(j)  for  j  e  [/o, «  -  1) 

because  l  was  the  optimal  location  outside  the  last  block  where  si  =  k.  By  the  same 
token,  it  is  clear  that  if 


Uj-Urru^  Up(i-l)-Um(i-l), 


the  new  optimal  location  will  be  t  —  l.i 


Appendix  4.B 


DYNAMIC  PROGRAMMING  FORMULATION  OF  THE 
ONE-DIMENSIONAL  MAP  ESTIMATION  PROBLEM 


In  this  appendix  we  present  an  algorithm  for  finding  the  global  minimum  of: 

Up  =  e'  V(fi.  fi+ 1)  +  «  Z  */M  (!) 

i=i  i=i 

which,  based  on  dynamic  programming  principles,  reduces  the  problem  to  a  sequence 
of  one  dimensional  optimizations. 

As  we  will  see,  this  algorithm  generates,  as  a  byproduct,  a  family  of  solutions 
which  can  be  considered  as  descriptions  of  the  field  /  at  different  scales,  so  that  the 
coarse  descriptions,  which  are  computed  very  fast,  are  progressively  refined  until 
the  optimal  (finest  scale)  configuration  is  found. 

This  approach  is  based  on  the  following  idea: 

A  configuration  /  is  completely  characterized  by  the  value  of  f\,  and  the  set 
in  defined  by: 

U~{L  :  ;  M*«.  (2) 

We  will  call  the  n  elements  of  Ln  the  "boundaries"  of  the  configuration  /.  Since 
these  boundaries  correspond  to  odd  bonds  between  neighboring  cells,  we  can  define 
an  equivalent  energy  function  as: 

v(!)  =  n  +  \0(l)  (3) 

with  £(/)-£  ♦,,(*),  (4) 

I 

For  a  fixed  n,  U  depends  only  on  the  value  of  /i,  and  on  the  position  of  the  n 
boundaries,  that  is,  onn  +  1  variables.  To  make  this  dependence  more  explicit,  let 


us  define  the  functions 


G(L)  =  £(♦*(*;)-•*,(*■)) 

>=i 


Let  t/0  and  U\  denote  the  energy  functions  corresponding  to  the  configurations  with 
/i  =  k i  and  kQ ,  respectively,  for  a  given  set  of  boundaries 


Ln  —  {^1|  •  •  J>n},  L\  <  . .  •  <  Ln 


We  have  that,  for  n  even, 

^o(n.  in)  =  n  +  -[53  ^*0(3.7)  +  5Z  ♦*,(?>)  +  •••  +  Z) 

Z  >=1  L,  +  l  in+l 


=  n  +  2[G(Li)  -  G(Lt)  +  ...-  G{Ln)  +  £  ♦*(*■)] 

2  >-1 

C/|(n,  £.„)  =  n  4-  -[£  $*,({?/)  +  5Z  +  •  •  •  +  5Z  *fci(9/)] 

*■  J=l  t,t  +  l  tm  +  l 

«  +  ?[-G(Z,)  +  •  •  •  +  G(Ln)  -  G-(/V)  +  £  **(*,)] 


and  for  n  odd. 


f/o(n,  JLn)  =  n  +  -(G(Li)  —  G{Li)  +  •  •  •  +  G(Ln)  —  G(N)  +  ]T]  ^fe«(ffy)l 


CM",  U)  =  n  +  £(-C(i,)  +  . . .  -  C(i»)  +  £  +».(»;)] 


(Note  that  4>fco(g;)  does  not  depend  on  /). 

Let  sl.0), 5^  be  the  sets  of  boundaries  that  minimize  Uo  and  Ult  respectively. 
Then,  the  optima!  energy  for  a  given  n  is: 

f/l  =  mm[l/„(n,  sf),  U,  (n,  sj,'1)]  (9) 


A 


Wc  will  define  S„  to  be  the  corresponding  optimal  set  of  boundaries. 

The  determination  of  s{^  is  an  n-dimensional  optimization  problem.  However, 
as  we  will  show  below,  it  is  possible  to  decompose  it  into  a  sequence  of  one 
dimensional  optimizations  using  a  dynamic  programming  formulation.  With  this 
approach  we  also  get,  as  a  bonus,  the  solutions  s[k\ . . .,  5^2 1(  k  6  {0,1},  and 
the  corresponding  optimal  energies.  If  we  set  n  =  N ,  the  solution  to  the  original 
problem  (3),  can  then  be  found  by  a  one  dimensional  search.  'Phis 

strategy,  however,  can  be  dramatically  improved  by  the  use  of  the  following  facts: 

(i)  We  can  reduce  substantially  the  search  space  for  the  location  of  the  optimal 
boundaries  Lj  G  Sn •. 

(ii)  The  sequences  . . .}  and  {U],  U\, . . .}  are  unimodal.  This,  together 

with  the  fact  that  the  dynamic  programming  algorithm  uses5J_1  to  compute 
Sj  provides  us  with  an  efficient  stopping  criterion  for  the  computation  of 
the  sequence  {5i, . . .,  5n-}. 

(iii)  The  expected  value  of  n*  is  usually  small. 

We  will  now  describe  the  algorithm,  and  analyze  each  one  of  these  facts. 

1.  Search  Space  for  the  Optimal  Boundaries. 

Let 

PM  =  {MuM2,...}  = 

=  O’:  G( j  -  1)  <  G(j)  >  G(j  +  1),  with  G(j  -  1)  ^  G(j  +  1)}  (10) 

Pm  =  = 

=  O':  G{]  -  1)  >  G(j)  <  G(j  +  1),  with  G(j  -  1)  ^  G(j  +  1)}  (11) 

(Conventionally  we  include  j  =  1  in  PM,  if  0  <  G(l)  >  G{ 2),  and  include  it  in  Pm 
if  0  >  G(l)  <  G( 2)).  We  define  the  set  P  as 

P  =  PM  \J  Pm  =  {Pi . Pr } 

(Note  that  P  corresponds  to  the  set  of  places  where  the  sequence  {^(gy)  -  4>tl(gy)} 
changes  sign). 


In  what  follows,  wc  will  call  the  elements  of  PM,  Prn  and  P,  the  "maxima", 
"minima",  and  "critical  points"  of  G,  respectively. 

Let  5n-+  (S„-_)  denote  the  subsets  of  S„ •  formed  by  those  boundaries  Lj 
whose  corresponding  term  G(Lj )  has  positive  (negative)  coefficient  in  (7*.,  i.e.,  if 

sn-  =  S«‘>  =  . . K-), 


then. 


>S'n*+  —  kl+k,  •  •  ■} 


Sn‘-  —  Sn"  ~  Sn'+ 


With  these  definitions,  we  have: 


(12) 


Theorem  1:  Suppose  that  Qko{gj)  -  ^kl[g})  7^  0.  for  all  j  (a  situation  that  will  occur 
with  probability  1  for  most  observation  models).  Then,  Sn*+  C  Pm  and  5n*_  C  PM. 

To  see  why  this  is  true,  let  fML  denote  the  maximum  likelihood  estimate  for 
/  obtained  by: 

fML  if  **,(»;)  >  *ko(9i) 

}  to,  otherwise 

and  let  /*  be  the  optimal  estimate.  Suppose  that  for  some  j  we  have,  say,  Lj  G 
5n-+  -  Pm.  Suppose  Lj  G  (Ft,  Ffc+l),  for  some  Pk,  Pk+{  G  P.  Clearly,  either  Pk  g  Pm 
or  Pk+ i  €  Pm-  Suppose,  for  definiteness  that  Pk  g  Pm- 

If  Pk  £  5n-,  the  configuration  {L\, . .  -Lj_x,  Pk,  Lj+l, . .  .£,„•}  has  lower  energy 
than  Sn •  (we  decrease  U  without  altering  n),  which  is  a  contradiction.  If  Pk  G  5n*, 
then  either 

/'((n.iyj)  /  fMLm,Lj)) 

or  /•((£>■,  ft+1))  *  Pk+,)) 

and  so,  we  get  a  lower  energy  configuration  by  deleting  Lj  and  either  Pk  or  Pk+l  (we 
decrease  simultaneously  n  and  U).  A  similar  argument  can  be  used  if  Lj  G  [l,  Pi) 
or  Lj  G  (P„/VJ.| 
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This  result  means  that  we  can  use  P  to  constrain  the  search  space  for  the 
boundaries  of  each  subproblem  (i.e.,  for  each  fixed  n),  which  now  becomes: 

For  n  <  | P |  fixed,  find  Sn  —  {L\, . .  Ln}  with 

Sn+  C  P,n  and  S„_  C  PM  (13) 

such  that  U(n,  Sn)  <  U[n ,  Ln)  for  all  L„  C  P. 

Note  that  theorem  1  guarantees  that  the  constrained  and  unconstrained  solutions 
will  coincide  only  forn  =  n\  so  that  forn  f  n\  Sn  may,  in  general,  be  suboptimal. 

2.  Dynamic  Programming  (DP)  Algorithm. 

From  equations  (7)  and  (8),  it  is  clear  that,  for  any  fixed  n,  the  determination 
of  the  optimal  (constrained)  configurations  sL0),  si,1*  is  equivalent  to  the  solution  of 
the  optimization  problems: 

For  sl,0*: 

Minimize  [G(Li)  -  G{L?)  +  . . .] 
with  Li,Lz,...e  Pm,  and  L2,  L4,...e  Pm . 

For  Si,1*: 

Maximize  [G(Li)  -  G(Li)  +  . . .) 
with  Li, 6  PM,  and  L2,L4l...  £  Pm. 

Let  us  consider  the  maximization  problems.  Assume,  for  definiteness  that  the 
first  critical  point  of  G  is  a  maximum,  i.e.,  M\  <  m[,  and  define  the  sequences: 

D{{k)  =  sup  G[M{) 
i>k 

L\{k)  —  {min  L  :  G[ML)  =  D,(fc)}.  k=\...\PM\  (14) 

Clearly.  A is  the  optimal  location  of  the  boundary  for  n  =  1  (i.e., 
Sj1'  =  {Af/,,11)}),  and  from  Dt(l)  we  can  easily  compute  the  corresponding  energy. 
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We  now  define,  for  j  >  1: 


! 


and 


D,j{k)  =  sup{D2j_,(t  +  1)  -  C(m,)} 

»>fc 

^2y+i(fc)  =  sup{D2>(t)  +  G(M,)} 

t>k 

Wfc)  =  {min  L  :  D2j{k)  =  D2j-\{L  +  1)  -  G(m,,)} 
^2j>i(fc)  =  {min  L  :  D2j+\(k)  =  D2y(L)  +  C(Aff,)} 


for  k  =  l, . . |PM|  -  >•  One  can  check  that,  for  n  odd, 

5"}  =  {A^LB(l),"l^_1(/.n(t)),---(M/.1(/,5(...(L»(l))...)} 


I 


and  the  optimal  energy  is: 


tfi(")««  +  ?l-A.(i)  + £♦*(*)] 

1  j 

For  n  even,  we  define: 


Di(k)  =  sup{-G(m,)>  ,  fc  =  1, . . I^m] 

»>4 

Li(fc)  =  {minL  :  D\{k)  = -G{mL)} 

D'2}(k)  =  supiD'^i)  +  GW)  ,  k  =  l,...\Pm\-j  +  l 

*>k 

^2j{k)  =  {min  L  :  D2j{k)  =  Djy-iW  +  G[ML)} 
D\i+iSnp{D'2i{i  +  1)  -  G(m,)}  ,  k  =  1, . .  .|Pm|  -  ; 

i  >  i 

^y+l(fc)  =  {min  L  :  D2j+l{k)  =  Djy  -  G(m/v)} 

and  get: 

S(n'>  =  {Mz,n(l).---mLi(Ln(...Ln(l))...)} 

Ui(n)  =  n  +  ^[-£>n(l)  -  G(N)  +  X>fc0(<7y)] 


For  the  minimization  problems,  that  is,  for  the  compulation  of  S„  \  assuming  again 
that  Mi  <  mi,  we  have,  ‘hr  n  even: 


di{k)  =  inf  {-G(m,)> 

l>* 

ii(fc)  =  {mini  :  d\{k)  =  —  G{mt)} 

and  for  j  >  1, 

d2j(k)=\n(k{d2j_l(i)+G(Mi)} 
hj{k)  =  {min  /  :  d2j{k)  =  d2j^l)  +  G(M,)} 
d2j+iW  —  ,»nf  {<M*  +  1)  -  G(m,)} 
i2>+i(A:)  =  {min  l  :  d2j+i[k)  —  d2j{l  +  1)  -  G(m,)}  (21) 

with  k  varying  in  the  appropriate  range.  The  solutions  are: 

S(„01  =  ..  ..,,.11))...)} 

U„(n)  =  n  +  ?  (d„(l)  +  £  <Msy)]  (22) 

For  n  odd: 

AW  =  inf  {G(M.)} 

»>* 

{<4/-i(<  +  *)  -  c(m.)} 

4+l(‘)  =  .inf  (4(«)  +  G(M,)}  (23) 

with  the  corresponding  definitions  for  Z’-(fc).  The  solutions  are: 

5n0)  = 

t/o{n)  =  n  +  ^(4(1)  -  G(ZV)  +  £  <t>*0(<7y)]  (24) 

The  case  for  which  mi  <  Mi  is  treated  in  a  similar  way. 

The  recursions  (15),  (18),  (21)  and  (23),  together  with  equations  (9)  and  (10), 
allow  us  to  compute  die  sequences  {Si,S2.  •••}  and  {U\,U2,--  }  using  only  one 


dimensional  optimizations.  We  now  turn  to  the  problem  of  determining  the  optimal 
value  n*  for  the  number  of  boundaries. 

3.  Stopping  Criterion. 

In  this  section  we  prove  the  following: 

Theorem  2.  Suppose  that  every  (constrained)  optimal  configuration  in  the  sequence 
{•Si,  52, . . .}  is  unique  (i.e.,  for  every  n,  if  S’n  7^  S„,  and  Sn  C  P,  then  U{n,S'„)  > 
t/*)  and  that  for  some  n,  U*n+2  >  C* .  Then,  f/*+2t  >  U*n,  for  all  A:  >  1. 

This  result  will  provide  us  with  an  efficient  stopping  criterion  for  the  dynamic 
programming  recursions  described  in  the  previous  section;  since  the  first  local 
minima  for  the  subsequences  {U\,U\, . . .}  and  are  the  global  ones, 

we  can  terminate  the  computations  once  we  have  found  them. 

To  prove  the  theorem,  we  will  need  the  following  lemmas: 

Lemma  l.  Let  Sk  =  {Llt...,Lk}  and  5fc+2  =  {L\,  •  ■  L'k+v)  be  optimal 
boundaries  (with  corresponding  configurations  /*  and  /*+ 2)  for  n  =  k  and  n  = 
A:  +  2,  respectively.  Suppose  that  k  +  2  <  \P\.  Then,  Sk  C  S*+2  (i.e.,  Sk+2  is  a 
refinement  of  Sk),  provided  Sk  is  unique. 

Proof: 

We  will  assume  that  for  some  j,  Lj  e  Sk-  Sk+2,  and  arrive  at  a  contradiction. 
We  consider  three  cases: 

Case  1:  Suppose  that  for  some  t. 

In  this  case,  we  claim  that  we  can  find  some  index  p  such  that 


and 

fk+2{{Lp,  Lpi_i ])  7^  fk{[Lf),Lp+i}) 

Suppose  that  this  is  not  the  case.  Then.  L\, 7,^,  are  the  only  elements  of  Sk+ 2 
in  some  interval  (Ly,  LJ+,)  (or  in  one  of  the  extreme  intervals  [l,Li),(Lk,  N J)  and 

fk+ 2((^»»  ^»+il)  7^  /*((T,»  T,'-(-iJ) 

Suppose 

C(Ly,Ly+1) 

By  condition  (13),  we  have  that  L}  ^  £•_,  (otherwise.  L3  would  be  a  local  maximum 
and  minimum  of  G  at  the  same  time).  But  then,  since  Sk  is  optimal,  we  can  find 
a  configuration  with  A:  +  2  boundaries  whose  energy  is  lower  than  that  of  Sk+ 2. 
by  moving  L\  to  L ,  (or  L-+1  to  Lj+i),  which  contradicts  the  optimality  of  Sk+2-  A 
similar  argument  holds  if 

C  (l,Li)  or  (Lk,N] 

This  proves  our  claim. 

So,  suppose  that 


and 

fk+2[{Lp,  ^p+ll)  7^  fk[{Lp,  Lp+  J). 


Form 

Sk  =  {Li, . . Lp_i,  Lp+2,  •  •  • t  L>k+ 2} 

and  let  f'k  be  the  corresponding  configuration,  chosen  in  such  a  way  that  /*(l)  = 
fk{  1)  (and  therefore,  f'k([Lp,Lp+  j])  =  fk{[Lp,Lp+l})). 

Let  At/  be  the  change  in  U  (see  eq.  (4))  associated  with  setting: 


f([L'P,  Lp+  ll)  —  fk+2([I'p>Lp+l]). 


We  have  that 

0{Skr2)  =  U{S'k)  +  AU. 

Now,  we  put: 

$k+ 2  =  {Li,  ■  ■  ■$  Lj>Lp,  Lp+l, . .  .,Lk). 

Since  Sk  is  optimal,  we  have  that: 

U(Sk+2)  =  U(S'k)  +  A U  >  U{Sk)  +  A £/  =  Cl(Sk+2), 
which  contradicts  the  optimality  of  Sk+2- 

Case  2: 

(ii.t',iu[tu2,ivi)nst=D 

Suppose  that  L\  €  [1,  L[).  We  must  have 

fk+2([l,  L\))  /*([!,  L\)) 


Otherwise,  if  L\  —  L2,  condition  (13)  generates  a  contradiction;  if  L\  >  L^,  we 
are  in  case  1,  and  if  L\  <  L^,  Sk+2  is  not  optimal,  since  we  get  a  lower  energy 
configuration  by  moving  L\  to  L\. 

So, 


By  a  similar  argument,  we  get  that 


fk+2({Lk+2,N])^fk{{L'k+2,N}). 


Now,  proceeding  as  in  case  1,  we  form: 

$k  —  {^2i  ••  •)Lk+ 1) 


and  let  f  \  be  the  corresponding  configuration,  chosen  in  such  a  way  that  f'k(  1)  = 


Let  A 0  be  the  change  in  U  associated  with  setting: 

/([^i ,  ^2))  = //t+2([^i>  ^2]) 

/([^k  +  li^*+2l)  =  fk+2{[Lk+\,Lk+2\) 


so  that 

U(St+2)  =  C/(5l)  +  A  0. 


Now,  we  form: 


Sk+2  —  {Li,  L\, . . Lk,  Lk+2}t 


Since  Sk  is  optimal,  we  have  that: 

f>(S*+2)  =  U(S'k)  +  A U>  U(Sk)  +  A U  =  U[S’k^), 
which  again  contradicts  the  optimality  of  Sk+2. 


Case  3: 

For  all  i,  [L,-,  £>t+i]  Sk  9^  0, 
and  ([l,L;iUl4+2,N])n5t^0  (*) 

To  make  (*)  hold,  we  must  be  able  to  place  ik  boundaries  in  k  +  3  (ovelapping) 
closed  intervals,  without  omitting  any  interval.  Moreover,  since  condition  (13)  must 
hold,  we  cannot  put  L,  =  L\  and  Lj+l  =  Li+2  for  any  i,j.  But  this  is  impossible; 
so,  our  proof  is  finished.  1 

Lemma  2.  Let  AUk  =  U(Sk)-U(Sk+2).  Then,  AUk  <  AUk-2,  for  all  fc  g  [3, \P\-2]. 


Proof: 

Consider  the  optimal  configurations  Sk,Sk+2,  Sk+4,  and  suppose  that  A Uk+2  > 
A Uk.  Using  lemma  1,  let 


Sk  —  •  •  •>  Lk}i 

Sk+ 2  =  {^ii  •  •  •.  Llt  L>2, . . . ,Lk }. 


By  condition  (13)  and  lemma  1,  there  are  only  two  valid  forms  for  Sk 
consider  each  case  separately: 

Case  1:  Sk+A  is  of  the  form: 

St +4  —  {//i, .  • Lp,  Llt  ly,  Lp+\. . .,  Lj,  I/?, . . .} 

(i.e.,  the  refinements  corresponding  to  Sk+ 2  and  Sk+^  are  disjoint). 

Then,  for 

S ic+2  —  {^1,  •  •  •,  Lp,  Lp+i, . . Ll,L2, . . .}, 

we  have 

U(S’k+2 )  =  U(Sk)  -  AUU2  <  U(Sk)  -  A Uk  =  a(St+2), 
which  is  a  contradiction. 


Case  2:  St+4  is  of  the  form: 


Sk+i  —  {L\, . . .,  Lj,  Llt  Llt  Lj, . . .} 


(i.e.,  St+4  is  a  subrefinement  of  the  refinement  introduced  by  St+2). 
Let 

a  =  —U({Li, . . Lj,  L1,Ll,  Lj+\, . . .}  +  U(Sk) 
b  =  U[{LU . . .,  Lj,  L”,  1^,  Lj+i, U(Sk) 
c  =  —  U{{L\, . . .,  Lj,  L2,  L},  Lj+i, . . .}  +  U(Sk) 


We  have  that. 


AUk  =  a  +  c  —  b 
A  Uk+2  =  b. 


By  assumption, 


6  >  a  +  e  —  b 


and  therefore, 

.  "4*  c  «  4 

At/*  =  a  +  c  —  b  <  — - —  <  max(a,  cj. 

Now,  let  S'k+ 2  be  formed  from  S*  by  the  refinement: 

(L’i.L’i,  if  a  =  max(a,c) 

1^2, 1^2,  ifc  =  max(a,e) 

Then, 

f>(5;+2)  =  U(Sk)  -  max(a,  c)  <  U(St)  -  A Uk  =  U(St+2), 
which  is  a  contradiction.! 

Now  we  prove  theorem  2: 

Suppose  U\+ 2  >  U\.  Then, 

k  +  2+^U(Sk+2)>  k+^U(Sk) 
now,  by  lemma  2  we  have: 

U[+i  =  k  +  4  +  ^(St+4)  =  *  +  4  +  |(t/(5t)  -  Atft+2)  > 

>  k  +  2  +  ?{C/(S,)  -  A&*+s)  >k  +  2  +  |(£>(Sj)  -  Afr»)  = 

=  k+2+ j&(s4+2>  =  t/;+3  1 

4.  Expected  Value  of  n\ 

First,  we  compute  the  (prior)  probability  density  function  p(n)  for  the  number 
n  of  odd  bonds  in  the  original  field  /. 

Let  Nb  =  N  -  l  be  the  total  number  of  bonds.  We  can  rewrite  equation  (1) 
as: 

P(u  =  /)  =  lei*"*-2")  (25) 

z 
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The  total  number  of  configurations  compatible  with  a  given  n  is  2C™\  ai 


2C^cxp[({(N6-2n)] 


C^cxp[i(Nfc-2fc)] 


et/a  +  e-l  /a 

which  is  a  binomial  distribution.  Therefore, 


el/a  +  e-i /a 


E[n\  =  N6(  - - ) 

Var(n]  =  Nh( - - - ) 


We  note  that  as  a  f  oo,  E[n]  t  ATfc/ 2,  and  as  a  J.  0,  E[n]  1  0  (and  var[n]  [  0) 
exponentially  fast.  This  means  that  if  the  natural  temperature  of  the  system  is  not 
too  high,  we  can  expect  that  n\  the  MAP  estimate  for  n,  to  be  relatively  small. 

5.  Relation  to  Multiscale  Filtering. 


An  interesting  characteristic  of  the  DP  formulation  is  that  the  solutions  to 
each  of  the  subproblems  (which  in  fact  correspond  to  a  minimization  of  U  (eq. 
(4))  are  independent  of  the  value  of  the  parameter  a.  The  role  of  this  parameter 
is  to  determine  the  number  of  regions  (n*)  that  will  be  present  in  the  optimal 
configuration.  In  this  sense,  it  can  be  regarded  as  a  "scale"  parameter  that  controls 
the  aggregation  of  the  subregions  into  larger  units,  and  the  algorithm  can  be  used  to 
produce  multiscale  descriptions  (in  the  style  of  the  "fingerprints"  treated  in  Witkin 
(1983)  and  Yuille  and  Poggio  (1983))  of  the  input  signals.  (Several  other  heuristic 
solutions  to  this  problem  have  been  proposed.  See,  for  example,  Blumenthal  ct  al., 
1977;  Prazdny,  1982  and  Pavlidis,  1973) 

If  we  interpret  the  algorithm  in  this  way,  it  becomes  natural  to  ask  whether 
a  family  of  linear  operators  can  do  the  same  job  in  a  much  efficient  way.  Let  us 
formulate  this  question  in  more  precise  form  (in  what  follows,  we  will  consider  a 
"continuous  time"  problem  obtained  from  the  original  one  as  a  limit  when  N  \  oo 


(provided  that  the  observations  are  different  from  0  only  in  a  finite  interval),  since 
it  simplifies  the  notation.  It  should  be  clear  that  the  same  arguments  apply  to  the 
discrete  case). 


Consider  a  family  of  filters  { FL }  with  the  following  properties: 

(i)  Each  Fl(x)  is  a  symmetric  and  non-negative  function  of  x. 

(ii)  For  each  L,  F/.( x)  is  a  decreasing  function  of  |z|,  and  FL(x)  j  0  as  |z|  |  oo 
fast  enough,  so  that  Fl  can  be  approximated  by  a  function  with  finite 
support 

(iii)  All  the  filters  are  normalized: 

f°°  Fi(x)dx  =  1,  for  all  L. 

J  —OO 

(iv)  The  filters  become  sharper  as  L  J  0: 

fQ  Fu{x)dx  <  J0 

implies  that  Li  >  L\ 


Particular  examples  of  acceptable  families  are: 

(i)  The  family  of  rectangular  boxes  Be'. 

fA.  if|x|<i 


BlM  =  {". 


otherwise 


(ii)  The  family  of  Gaussian  Kernels: 


Gl(x)  =  exp[  2^2 1 


y/2irL 


Suppose  we  convolve  the  function  g[x)  -  \  (g(z)  is  a  continuous  time 
approximation  to  the  observations)  with  a  set  of  filters  from  the  family  {Fl}. 
If  we  start  with  L  large  enough,  the  function 


hL  =  {g-  -) 


Fl 


will  he  practically  constant,  and  therefore,  it  will  have  no  zeroes.  As  we  decrease  L , 
zero  crossings  of  hL  will  begin  to  appear.  To  each  of  these  zero  crossings,  we  will 
associate  a  boundary ,  and  form  the  configurations  Si ,  ST  . . .  with  1,2,...  boundaries 
respectively,  that  correspond  to  the  first,  first  two.  etc.  zero  crossings  of  hlj  (we  are 
ignoring,  at  this  point,  the  question  of  the  precise  localization  of  these  boundaries. 
With  additional  contraintson  the  family  {F/,}.  it  is  possible,  in  principle,  to  localize 
them  by  decreasing  L  in  a  continuous  fashion,  and  then  tracing  the  position  of 
each  zero  crossing  to  the  finest  ( L  =  0)  level;  see  Yuille  and  Poggio  (1983).  For 
the  moment,  let  us  assume  that  we  can  identify  the  zero  crossings  of  g  -  ^  that 
correspond  to  those  of  h^,  for  all  L). 

The  question  that  we  ask  is  the  following; 

If  Si,S2>...  are  the  optimal  boundary  configurations  produced  by  the  DP 
algorithm, is  it  true  that 

sk  =  sk 

for  all  fc? 

As  we  now  show,  this  is  not  the  case. 

Consider  the  signal  g(x)  defined  by: 

»(*)  =  1  » 

for  x  6  +  2a]  U[i2,/2  +  26]  (J[/2  +  46,  l2  +  66]  (J 

(J[Z2  +  86,  l2  +  106]  U(/2  +  126, 1%  +  146]  (J[/2  +  166,  /2  +  186]  , 

and  g(x )  =  0,  otherwise.  Here,  /i,f2)a  and  6  are  some  positive  numbers  chosen  in 
such  a  way  that,  if  Lo  is  the  starting  L,  we  take  l2-l\  -  a  >  >  Lo,  so  that,  by 
property  (ii),  there  is  no  interaction  between  +  a ]  and  [l2, h  +  186]  (see  figure 

4.B.1). 

Suppose  that  the  zero  crossings  corresponding  to  [Zj,/i  +  a]  appear  first  (as  a 
single  double  zero)  at  L  =  L\,  and  those  corresponding  to  [/2,  l2  +  186]  at  L  = 


Then, 

fQ  FLl{x)dx  =  fa  FLl(x)dx  (28) 

fQ  FLiix)dx  +  f3b  F,Jx)dx  +  Jn  FLi{x)dx  = 

=  fb  Fu(x)dx  +  f™FLl(x)dx  +  £°  /%(x)<fx  (29) 

Now,  for  o  >  6,  we  have: 

=  106 

£({*3,M)  =  86  +  2a>  /2}) 

and  therefore,  52  =  {/i,/2}. 

We  claim  that  we  can  find  some  a,  b  with  a  >  b  such  that 

f0  FUx)dx  <  /  Fuix)dx 

If  this  is  true,  we  find,  using  (28)  and  conditions  (iii)  and  (iv),  that  it  implies  that 
Li>  L\,  and  therefore,  S2  =  {k,U}. 
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We  now  prove  our  claim: 

Let  a  =  b  +  j ,  where  we  choose  e  so  that 


(30) 


(property  (ii)  guarantees  that  we  can  find  such  e).  From  (29), 

foo  rb  e5b  rQb 

Jb  FLt{x)dx  =  JQ  Fu(x)dx  +  2  /36  FLt[x)dx  +  2  Jn  FLj[x)dx 


and  from  (30), 

roo  roo  r  oo  rb+(/2 

L  F^x) dx  =  L,/2F^i'>dz  =  h  n,(x) dx  = 

rb  ^6+(/2  ^96  /»6-fc/2  /96 

=  Jo  n.(*)<i*+y,  n,(x)<b+2  Jn  F^[x)dx  =  Ja  FL,(x)dx+ 2  Jn  FL,{,x)dx  > 

r9b  r<* 

>  Jn  FL^dx  =  Jo  F^(x)dx  ■ 


This  result  does  not  mean,  of  course,  that  families  of  linear  filters  cannot  be 
used  for  producing  useful  multiscale  descriptions  of  signals;  it  only  means  that  these 
descriptions  cannot,  in  general,  be  considered  as  MAP  estimates  of  MRF  models. 


6.  Continuous  Valued  Fields. 


In  this  section  we  present  a  related  problem  which  can,  in  principle,  be  solved 
using  the  DP  approach,  although,  as  we  will  see,  in  a  less  efficient  way. 


Let  us  consider  the  problem  of  estimating  a  piecewise  constant  signal  corrupted 
by  additive  white  Gaussian  noise.  We  model  the  signal  {/,}  as  a  MRF  with  potential 


V(/t>  fi  +  \ 


if  fi  —  fi+1 
otherwise 


(31) 
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and  global  states  distributed  according  to: 


P(F  =  /)  =  \  exp[-A  £  nh.lM)] 
Z  a  ,=  1 

The  observations  are  given  by: 


9i  —  fi  +  rii 

where  n  is  a  white  Gaussian  process.  The  Bayesian  (MAP)  estimate  for  /  is  again 
found  by  minimizing  eq.(4): 

v(t)  =  n  +  1u 

V  =  E(/,  -  s,f 

»+t 

where  n  is  the  number  of  places  where  /,  7^  /,+  lt  and  7  =  Note  that  in  this 
case,  /,  is  not  restricted  to  {0, 1},  but  can  take  any  real  value. 

Proceeding  as  wc  did  in  section  2.  we  consider  the  sequence  of  subproblems 
obtained  by  putting  n  =  0, 1, 2, . . .. 

For  any  fixed  n,  U  will  depend  only  on  the  n  integer  variables  that  correspond 
to  the  location  of  the  boundaries  between  regions  of  constant  /,  since  given  these 
boundaries  L  —  {Lv,...Ln}y  the  optimal  estimate  for  /  on  any  interval  (Lt)  L,+  1] 
(we  put  Lq  —  1  and  Ln+i  =  N )  is: 

/((A,^,I)=  -r-  -  ,  E 

^‘  +1  “  **  j=Li+ 1 

If  we  define  Gk,i  (for  k  <  /)  as: 

Gw  =  (1  -  2(1  -  E  _  (32) 


We  get  that: 


/V  n+ 1 

v{U)  =  E  Hi  +  E  G<-, 

»=1  ;=1 


(33) 


■a  -» 


— ■’■f  iwdLualu 
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(note  that  is  a  constant  for  a  given  set  of  observations).  Using  dynamic 
programming  principles,  we  can  now  write  the  recursions: 

F0{k)  =  Gk<N  ,  fc  =  0....N-1 

Fi+i(k)=  mtk{Gkti  +  F,{i)}  ,  k  =  0,...N-j-l 

Lj+l(k)  =  {L  :  GktL  +  Fj(L)  =  Fj+l(h)}  (34) 

The  optimal  solution,  for  each  given  n  is: 

Sn  =  {U 0),  L„-l(Ln(0)),  •  •  HW-  •  (^(0)).  .  .)} 

and  the  corresponding  energy, 

a(n,S„)  =  n+|[f;g?  +  F„(0)]  (35) 

The  solution  to  our  problem  will  be  Sn-t  where: 

U{n,Sn-)  =  mt{U{n,Sn)}  (36) 

fl 

Unfortunately,  in  this  case  we  cannot  guarantee  the  unimodality  of  any  subsequence 
of  {U(Sn)}  (although  we  believe  that  the  sequence  will  be  unimodal  in  many  cases) 
and  so,  (36)  has  to  be  computed,  in  principle,  by  an  exhaustive  one  dimensional 
search.  Another  unpleasantness  is  that,  unlike  the  binary  case,  the  search  space  for 
the  variables  L,  cannot  be  reduced  in  any  obvious  way. 


Appendix  4.C 


CONSISTENCY  CONDITION  FOR  THE  MPM  ESTIMATOR 


In  this  appendix  we  present  a  proof  for  the  consistency  condition  (given 
by  equation  (11)  of  chapter  4)  satisfied  by  the  MPM  estimator  of  a  binary, 
two-dimensional  Ising  net: 

Theorem:  Let  P[f,g)  be  the  posterior  distribution  corresponding  to  the  estimation 
of  the  first  order,  binary  MRF  /  from  the  observations  g  which  are  obtained  as  the 
ouput  of  a  binary  symmetric  channel: 

P(f.  9)  =  J  exp(-  ■£  V{h.  /,)  - 1  E(1  -  HJi  ~  ft))] 

*J  »' 

Let  /  be  the  MPM  estimator  for  /.  Then,  for  every  site  t, 

E  P(f,9)  >  £  P{f,9) 

/:/<=/,  t-U*U 


implies  that: 


E  ni.h  >  e  p(f.f) 


1)  We  first  prove  that  for  all  i : 


/  / 


implies  that: 

E  p(/w,9w)  >  E'V'V1)  : 

/  / 

Suppose  that  g  ^  (otherwise,  the  above  is  obviously  true).  For  any  fixed  /, 
have  that: 


and 


=  ^ {exP[  £  /y)  -  7]  -  «p[  £  V(1  -  />)]} 

yeAT(  ;6  ‘V, 


P(/W,9W)-P(»>W.»W)  = 


=  K{exp[-  ^(/.'./y)l  -e*p[  E  -  /.■/,)  -  ll) 
ye*.  ye*. 


Where  K  is  a  constant.  Since  7  >  0,  this  implies  that: 

P{f('\  9)  -  P(hM,  g)  <  P(f{'K  9{i))  ~  P(h{i),  m 


so  that 


EPt/f'l.jW)  _/>(*(•!, ,(,))  >  £P(/W,  j)-P(ftl'U)  >  0 
/  / 


2)  Let  r,  =  1  -  /,.  We  now  prove  that  if: 

£  P(f,g)>  £  lUri 


then. 


E  p(/.jm)>  E  p(/.»w) 


For  t  =  j,  part  (1)  applies,  and  for  gW  =  g,  the  assertion  is  obviously  true,  so 
suppose  i  ^  j  and  g^  ^  g.  We  have: 

=  A-,{exp(-  £  V(/„/j)l-'x p|£  ni-/../y)-ll) 

j'G/V,  jeNt 

n/,,|.9W)-mW.JM)  =  «-i{exp[-  £  V(/,,/,)|- 

](zNi 

exP[  E  F(x  “  /.•»  4)  -  7]}  exp[-7(l  -  2(4  -  gv))2]  > 
>e-\P{fM,g)-P(hM,g) 
for  some  constant  K 1,  so  that 

E  9(i))  >  e"1  E  P(/(,)-  ?)  ~  9)  >  0 

/  / 

The  theorem  is  now  proved  by  assuming  that 

E  P(f'9)>  E  4(/,g) 

and  succesively  replacing  g,  by  for  t  =  1,2,..  .and  using  (1)  and  (2)  to  show  that 
the  corresponding  inequalities  hold  at  each  step. 


C  hapter  5 


RECONSTRUCTION  OF  PIECEWISE  CONTINUOUS  FUNCTIONS 


1.  Introduction. 

In  this  chapter  we  will  illustrate  the  application  of  local  spatial  interaction 
models  and  estimation  techniques  that  we  have  described  to  the  solution  of  the 
general  reconstruction  problem  that  we  introduced  in  chapter  1.  To  make  this 
discussion  more  specific,  we  will  consider  a  particular  instance  of  this  problem:  the 
reconstruction  of  piecewise  continuous  functions  from  noisy  observations  taken  at 
sparse  locations. 

In  this  reconstruction,  it  will  be  important  not  only  to  interpolate  smooth 
patches  over  uniform  regions,  but  to  locate  and  preserve  the  discontinuities  that 
bound  these  regions,  since  very  often  they  are  the  most  important  parts  of  the 
function.  They  may  represent  object  boundaries  in  vision  problems  (such  as  image 
segmentation;  depth  from  stereo;  shape  from  shading;  structure  from  motion,  etc.); 
geological  faults  in  geophysical  information  processing,  etc. 

The  most  successful  approaches  to  this  problem  (see  Terzopoulos  (1984))  consist 
of,  first,  i:  ‘^rpolating  an  everywhere  smooth  function  over  the  whole  domain;  then, 
applying  some  kind  of  discontinuity  detector  (followed  by  a  thresholding  operation) 
to  try  to  find  the  significant  boundaries,  and  finally,  to  re-interpolate  smooth  patches 
over  the  continuous  subregions. 

The  results  that  have  been  obtained  with  this  technique,  however,  are  not 
completely  satisfactory.  The  main  problem  is  that  the  task  of  the  discontinuity  detector 
is  hindered  by  the  previous  smooth  interpolation  operation.  This  becomes  critical 
when  the  observations  are  sparsely  located,  since  in  this  case,  the  discontinuities 
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may  be  smeared  in  the  interpolation  phase  to  such  a  degree  that  it  may  become 
impossible  to  recover  them  in  the  detection  phase. 

One  way  around  this  difficulty  is  to  perform  the  boundary  detection  and 
interpolation  tasks  at  the  same  lime.  In  the  method  we  will  present,  this  is  done 
by  using  a  Bayesian  approach,  and  including  in  the  posterior  distribution  our  prior 
knowledge  about  the  smoothness  of  the  function  and  about  the  geometry  of  the 
discontinuities,  as  well  as  the  information  provided  by  the  observations.  Before 
describing  how  this  is  done,  let  us  formulate  the  problem  in  a  more  precise  way. 

Consider  a  region  ft  of  the  plane  which  is  formed  by  a  number  of  subregions 
separated  by  boundaries  which  are  known  to  be  piecewise  smooth  curves.  Suppose 
that  within  each  of  these  subregions,  some  property  /  (in  what  follows,  we  will  refer 
to  /  as  "depth")  varies  in  a  smooth  fashion,  presenting,  at  the  same  time,  abrupt 
jumps  across  most  of  the  boundaries.  Suppose  also  that  we  have  measurements  for 
the  values  of  /  at  some  discrete  set  of  sites  S;  these  measurements  will,  in  general, 
be  corrupted  by  some  form  of  noise. 

Our  problem  is  then  to  estimate  the  values  of  /  on  some  finite  lattice  of  points 
L  C  f2,  and  to  find  the  position  of  the  boundaries,  using  all  the  available  information 
in  an  optimal  way. 

2.  Posterior  Distribution. 

To  apply  the  general  recontruction  algorithms  developed  in  chapter  3  to  this 
problem,  we  need  to  cast  it  in  probabilistic  terms.  The  main  issue  here  is  the 
representation  of  the  concept  of  "piecewise  continuity"  in  the  form  of  a  prior  Gibbs 
distribution  in  a  meaningful  way. 

This  could  be  done,  for  example,  by  modeling  the  function  as  a  first  order, 
continuous  valued  MRF  with  nearest  neighbor  potentials  given  by: 

[(/.  -  /y)2.  if  I fi  ~  fj\  <  a  and  |*  -  ;|  =  1 
n/- /y)  =  j b,  if  |/,  -  fj\  >  a  and  |.  -  >|  =  1 

lo,  otherwise 
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where  a  and  b  are  positive  constants  such  that  b  >  a2,  and  for  every  pair  of 
neighboring  sites  i,j,  | /,  -  f}\  <  a  if  both  i  and  j  lie  in  the  same  smooth  patch, 
and  |/,  -  fj |  >  a,  otherwise. 


This  scheme,  however,  has  the  disadvantage  of  not  allowing  for  the  explicit 
modeling  of  prior  knowledge  about  the  geometry  of  the  curves  that  bound  the 
smooth  patches  (the  fact  that  they  should  be  piecewise  smooth  curves,  for  example). 
A  more  flexible  construction  involves  the  use  of  two  coupled  MRF  models:  one  to 
represent  the  function  (the  "surface")  itself,  and  another  to  model  the  curves  where 
the  field  is  discontinuous.  A  coupled  model  of  this  kind  was  first  used  by  Geman 
and  Geman  (1984)  in  the  context  of  the  restoration  of  piecewise  constant  images. 
We  will  now  describe  their  work  in  detail,  and  define  a  related  model  that  can  be 
used  for  our  problem. 


2.1.  Coupled  Line  and  Depth  Models. 


In  Geman  and  Geman’s  work,  the  intensity  of  the  images  is  modeled  using  a 
first  order  MRF  with  generalized  Ising  potentials  (see  chapter  4).  The  boundaries 
between  constant  regions  are  modeled  using  a  "line  process"  l,  which  is  a  MRF 
whose  associated  random  variables  are  located  at  the  sites  of  the  dual  lattice  of 
lines  that  connect  the  sites  of  the  original  intensity  lattice  (see  figure  12).  These 
variables  may  be  binary  (indicating  the  presence  or  absence  of  a  boundary  between 
two  pixels),  or  may  take  more  values  to  indicate  the  orientation  of  the  boundary  as 
well.  In  both  cases,  their  function  is  to  decouple  adjacent  pixels,  reducing  the  total 
energy  if  the  intensities  of  these  pixels  are  different. 

This  is  done  by  modifying  the  prior  energy  function;  the  new  expression  is: 

Volf, 0  =  E  E  *?(/..  />. +  E  yc,l 0  (1) 

•  Ci 


Vj{fi,  fj,  lij) 


0, 


if  hj  is  "on" 
otherwise 
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Figure  13.  (a)  Cliques  for  the  line  process  used  by  Geman  and  Gcman.  (b)  Additional  cliques 
used  to  prevent  sharp  turns. 


i. 

lo. 


if  |t-j|  =  i  and  fi  ^  fj 
otherwise 


lij  is  the  line  element  between  sites  i  and  j,  and  the  line  potentials  VCl  have  as 
supports  cliques  of  size  4,  such  as  the  one  shown  in  Fig.  13-a.  Every  line  element 
(except  at  the  boundaries  of  the  lattice)  belongs  to  2  such  cliques.  The  values  of 
the  potentials  associated  with  each  possible  configuration  of  lines  within  a  clique 
must  be  specified.  Thus,  for  example,  if  straight  horizontal  and  vertical  boundaries 
are  likely  to  be  present,  a  binary  process,  with  potential  values  as  those  of  Fig. 
14  is  used  (rotational  invariance  is  assumed).  In  more  general  situations  (such  as 
piecewise  smooth  boundaries),  we  may  use  difTerent  values  for  the  potentials,  or  we 
may  allow  more  states  for  the  line  elements,  corresponding  to  difTerent  orientations, 
augmenting  consequently  the  table  of  values  for  the  potentials. 

2.2,  Models  for  Piecewise  Continuous  Functions. 

The  model  we  have  described  can  be  adapted  to  our  problem  by  modifying 
the  choice  of  the  potentials  and  the  neighborhood  structure  of  the  coupled  MRF’s. 
Specifically,  the  following  modifications  are  needed: 

1.  Since  in  our  case  the  observations  are  sparse,  it  becomes  necessary  to  expand 
the  size  of  the  neighborhood"',  of  the  line  field,  to  prevent  the  formation  of  "thick" 
boundaries  between  the  smooth  patches  (i.e.,  adjacent,  parallel  segments  of  active 
lines  in  these  regions).  In  particular,  we  propose  that  the  dual  lattice  be  8-connected, 
with  non-zero  potentials  for  the  cliques  of  the  form  illustrated  in  figure  13  (a) 
and  (b).  The  inclusion  of  the  cliques  of  figure  13-b  has  the  additional  advantage 
of  penalizing  the  occurance  of  sharp  turns,  permitting  us  to  model  the  formation 
of  piecewise  smooth  boundaries  (a  more  general  case)  using  a  binary  line  process 
instead  of  the  4-valued  process  proposed  by  Geman  and  Geman.  The  potentials 
for  these  cliques  are  computed  in  the  following  way: 

Let  Va,  Vj,  denote  the  potentials  associated  with  the  cliques  Ca,Cb  of  figure  13 
(a)  and  (b),  respectively,  and  let  S*  (fc  €  {a,  6})  denote  the  number  of  line  elements 
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Figure  14.  Potentials  for  the  different  configurations  of  a  line  process 
belonging  to  Ck  that  are  "on”  at  a  given  time,  i.e., 

St=  £/,  ,  k  =  a,b 

»€C* 

The  potentials  Vk  are  given  by: 


V*  =  04k(Sk)  ,  k  =  a,  b  (2) 

where  0  is  a  constant,  and  the  functions  (f>k  are  defined  by  the  following  tables: 
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5 

<t>, 


1  2 
0  10 


It  is  not  difficult  to  see  that  this  choice  of  potentials  (notice  that  Va  will  be 
slightly  different  from  the  definition  of  figure  14)  will  effectively  discourage  both 
the  formation  of  thick  boundaries  ( Sb  =  2)  and  the  presence  of  sharp  turns  (S„  =  3 
and/or  Sb  =  2). 


2.  The  potentials  of  the  depth  process,  which  is  now  continuous-valued,  have  to  be 
modified  to  express  the  more  relaxed  condition  of  piecewise  continuity  (instead  of 
piecewise  constancy).  Specifically,  we  propose: 


v-,/,.  /„  i0) = h 


)2(l-/,y),  for  I*  -  /I  =  1 
otherwise 


(note  that  e  {0, 1}) 

3.  Unlike  the  case  of  piecewise  constant  surfaces,  we  now  have  to  worry  about  the 
maximum  absolute  difference  in  the  values  of  two  adjacent  depth  sites  that  we  are 
willing  to  consider  as  a  "smooth"  gradient  (and  not  a  discontinuity).  This  value, 
which  in  general  is  problem-dependent,  determines  the  magnitude  of  the  constant 
/?  in  equation  (2),  which  can  be  interpreted  as  the  coupling  strength  between  the 
two  processes. 

2.3.  Model  for  the  Observations. 

We  will  adopt  the  general  model  described  in  section  2.1  of  chapter  3  to 
represent  the  observation  process.  In  particular,  to  make  the  discussion  more 
specific,  we  will  assume  that  the  observations  g  correspond  to  samples  of  the  surface 
/  taken  at  a  set  S  C  L  of  sparse  locations,  corrupted  by  a  zero  mean,  white,  additive 
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Gaussian  noise  process: 


9i  =  fi  +  ni 


so  that  the  conditional  distribution  is: 


P}\g{9\f)  =  II  ~~~T  exP [-{fi  -  9t)2/2cr2] 
\f2no 


our  results,  however,  can  be  extended  to  handle  other  noise  models  as  well. 
Using  Bayes’  rule,  we  can  finally  write  the  posterior  distribution  as: 

1 


Pf,l\9UJ;g)  =  —  exp[— £//>(/,  /;  g)j 


with 

+,4  E(A  -  k)2  +  E^W  +  EW)  w 

la  ies  ca  ck 

Va  and  Vb  are  the  potentials  corresponding  to  the  ”o"  and  "6”  type  cliques  of  the 
line  process,  and  are  defined  by  equation  (2).  It  is  convenient  to  introduce  a  function 
q  which  is  equal  to  1  only  at  those  sites  where  there  is  an  observation,  and  is  equal 
to  zero  elsewhere  (i.e.,  q  is  an  indicator  function  of  the  set  5): 

-f1'  ifi€S 

to,  otherwise 

Using  this  function,  and  the  definition  of  V  from  equation  (3)  we  get: 


Vpd,  1;  s)  =  ^  E(A  - />)’(»  -'■'#)  + 

+ 4  E  (/.  -  afii  +  E  VM + E  m 

la  «ei  c.  Ck 


131 


(6) 


3.  Optimality  Criterion. 


We  can  now  apply  the  general  principles  developed  in  chapter  3  to  derive  the 
optimal  Bayesian  estimators  for  the  depth  and  line  fields.  As  a  performance  criterion 
we  will  use  a  mixed  cost  functional  of  the  form: 

i(zLf  jEL I 

where  Lf,Li  denote  the  depth  and  line  lattices,  respectively.  This  error  criterion 
means  that  the  reconstructed  surface  should  be  as  close  as  possible  to  the  true 
(unknown)  surface,  and  that  we  should  commit  as  few  errors  as  possible  in  the 
assertions  about  the  presence  or  absence  of  discontinuities. 

Appllying  the  results  of  section  5  of  chapter  3,  we  find  that  the  optimal 
estimators  will  be  the  posterior  mean  for  /  and  the  maximizer  of  the  posterior 
marginals  for  /.  Note  that  these  estimates  must  be  computed  by  averaging  over  all 
possible  values  of  both  /  and  /: 

/  i 

Pub)  =  £  £  P;,n9(f,i;g) 

f  i 

4.  Monte  Carlo  Algorithm. 

There  is  one  serious  difficulty  that  prevents  us  from  applying  directly  the 
general  Monte  Carlo  procedure  that  was  derived  in  chapter  3  to  the  computation 
of  these  optimal  estimates:  since  the  depth  variables  are  continuous-valued,  if  we 
discretize  them  finely  enough  to  guarantee  sufficient  precision  of  the  results,  the 
computational  complexity  of  either  the  Metropolis  ar  Gibbs  Sampler  algorithms 
will  be  very  large.  One  way  around  this  difficulty  is  to  note  that  for  any  fixed 
configuration  of  the  line  field,  the  posterior  energy  becomes  a  non-negative  definite 
quadratic  form: 

W|(,})=  E  (/.  -/y)2  +  «E(4  -«)’  +  *■ 


s 


(8) 


* 


where  a  and  K  are  constants  (note  that  the  first  sum  is  taken  only  over  those  pairs  of 
sites  whose  connecting  line  element  is  "off',  and  the  second  one  over  the  set  S).  This 
means  that  the  posterior  distribution  of  the  depth  field  is  conditionally  Gaussian, 
so  that  we  can  find  the  optimal  conditional  estimator  /’(/)  as  the  minimizer  of  (8) 
(for  a  Gaussian  distribution,  the  posterior  mean  and  the  MAP  estimate  coincide). 
If  Ms  identically  zero  (no  lines),  this  function  is  strictly  convex,  and  therefore  it  has 
a  unique  minimum.  Let  f'Q  be  the  corresponding  global  minimizer.  For  any  fixed 
configuration  /,  the  gradient  of  (8)  is  given  by: 

=  2  £  (/'  -  />)<.,  +  -  9<)  ffl 


where 

K  =  {j  :  \i-j\  =  1}  ; 

Ui  =  i  -  lij 


Setting  this  gradient  equal  to  0,  we  find  that  any  minimizer  of  U  will  be  a  fixed 
point  of  the  system: 


(t  +  l)  ZiCN,  tjjf  j  +  aclj9} 


__ 
J  j 


^i£Nj  f-ij  +  aQj 


if  ]£  +  a<7>  °> 

>6/V, 


and  /J*+1)  =  ff 


otherwise 


(10) 


We  note  that  the  updating  scheme  (10)  will  produce  a  decrease  in  the  value  of 
U(f  |  /),  regardless  of  the  sweeping  strategy.  In  a  synchronous  scheme  (where  all 
the  sites  are  updated  at  the  same  time),  the  energy  increment  will  be: 


A U(f  1 1)  =  U(f{k+l)  J  0  -  U(jW  1 1)  = 

=  -2£(I :  ^  +  -/!*¥- 

i£L  j£N, 
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because  U  is  non-negative  definite.  For  an  asynchronous  strategy,  where  /(*+1)  is 
obtained  from  fW  by  updating  only  the  site  i,  we  get: 

AU(f  1 1)  =  — 4(  £  (t]  +  aqt)(flk+'>  -  f!k)2  <  0 
jeNi 

Therefore,  if  we  set 

fm  =  f’o  (11) 

the  dynamical  system  defined  by  (10)  and  (11)  (with  a  given  sweeping  strategy)  will 
be  stable  and  have  a  unique  fixed  point  /]. 

Note  that,  since  U[f  \l)  is  always  convex,  f]  will  be  a  global  minimizer  (see 
Luenberger  (1973)),  but  in  general  it  will  not  be  the  only  one;  there  may  be  cases 
in  which  some  region  Q  within  which  there  are  no  observations  is  isolated  from  the 
rest  of  the  lattice  by  the  line  process.  In  this  case,  any  solution  for  which 

fi  =  constant ,  j  €Q 

will  also  minimize  U[f  \  l).  However,  for  a  fixed  initial  state  /(01  the  deterministic 
dynamical  system  (10)  will  always  converge  to  the  same  solution,  so  that  the 
configuration  /*(/)  is  well  defined. 

Let  us  define  the  set  F *  as: 

F’ =  {(/,*)  :  /  =  /;> 

It  is  clear  that,  if  f,  1  are  the  optimal  estimates  for  our  problem,  we  have  that: 

Cf,l)€F * 

which  suggests  that  we  can  constrain  the  search  for  the  optimal  estimators  to  this  set. 
This  can  be  done,  in  principle,  by  replacing  the  posterior  energy  with  the  function: 


«/■(')  =  u  (/;,<) 


(which  depends  only  on  /),  and  use  the  standard  Monte  Carlo  procedures  to  find 
the  optimal  estimator  1.  To  illustrate  this  idea,  let  us  consider  the  following  physicaJ 
model: 

It  is  a  «  “11  known  fact  the  the  steady  state  of  an  electrical  network  that  contains 
only  (current  or  voltage)  sources  and  linear  resistors  will  be  the  global  minimizer  of 
a  quadratic  functional  that  corresponds  to  the  total  power  dissipated  as  heat  (Oster 
et  al,  1971).  It  is  therefore  possible  to  contruct  an  analog  network  that  will  find 
the  equilibrium  state  of  the  depth  field  for  a  given,  fixed  configuration  of  the  line 
process,  i.e.,  that  will  minimize  the  conditional  energy  (8)  (see  Poggio  and  Koch, 
1984).  This  suggests  a  hybrid  computational  scheme  in  which  the  line  field  (whose 
state  is  updated  digitally,  using,  say,  the  Metropolis  or  Gibbs  Sampler  algorithms) 
acts  as  a  set  of  switches  on  the  connections  between  the  nodes  of  the  analog  network 
whose  voltages  represent  the  depth  process.  In  particular,  if  /,  represents  the  voltage 
at  node  i,  the  hybrid  network  can  be  represented  as  a  4-connected  lattice  of  nodes 
(see  figure  15)  in  which: 

(i)  A  resistance  (of  unit  magnitude)  and  a  switch  (controlled  by  the  line 
element  lxj)  is  present  in  every  link  between  pairs  i,j  of  adjacent  nodes. 

(ii)  If  an  observation  g{  is  present  at  site  t,  a  current  of  magnitude  equal  to 
ag,  is  injected  to  the  corresponding  node,  which  must  also  be  connected 
to  a  common  ground  via  a  resistance  of  magnitude  1  /a  (see  equation  8). 

A  direct  application  of  Kirchoff  current  law  shows  that  at  each  node  of  this 
network  we  will  have: 

22  (/♦  -  ZjX1  -  hi )  +  a<7 ifi  =  <*<U9i 

ieNi 

which  corresponds  to  a  fixed  point  of  the  system  (10).  In  practice,  there  will  always 
be  parasitic  capacitances  which  will  prevent  the  instantaneous  establishment  of  the 
equilibrium  conditions.  However,  the  time  constant  of  the  analog  portion  of  the 
network  may  be  made  very  fast,  so  that  in  fact,  the  probability  distribution  of  the 
equilibrium  states  of  this  network  will  be  Gibbsian  with  energy  U* . 

This  scheme  can  be  used,  in  principle,  to  construct  a  special  purpose  hybrid 
computer  for  the  fast  solution  of  problems  of  this  type.  In  a  digital  machine,  however. 


the  exact  implementation  of  this  strategy  •*- '  -i  in  general  be  computationally  very 
expensive,  since  f\  must  be  computed  every  time  a  line  site  is  updated.  We  will 
now  present  an  approximation  which  has  an  excellent  experimental  performance, 
and  leads  to  an  efficient  implementation. 

First,  let  us  examine  one  iteration  of  the,  say.  Metropolis  algorithm  at  a  given 


temperature  T  >  0  for  the  function  U\  When  a  line  site  is  visited  and  its  state  is 
updated,  the  corresponding  increment  in  energy  AUt  is  computed  as  follows: 

Suppose  the  line  site  ij  is  visited  (the  line  between  depth  sites  i  and  ;).  Let  k, 
be  its  current  state,  and  ftJ  the  candidate  state: 

lij  =  1  —  lij 

Suppose  that  the  current  state  of  the  depth  process  is 

/=/; 

and  let  be  the  fixed  point  of  (10)  obtained  when  we  replace  lX)  by  lXJ.  Let  us 
define: 

/=/; 

and 

Avxj  =  Y  va(l)-va(i)+  Y  m-v>V) 

c.iaec.  c>i,,ec> 

where  Ca,Cb  are  the  "a"  and  ’* b "  type  cliques  defined  in  figure  13,  and  Va,  V*.  the 
corresponding  potentials. 

Since  the  depth  process  is  at  equilibrium,  and  we  are  changing  only  the  element 
li}-,  we  may  assume  that 

?p  «=*/?  for  p^i,j  (12) 

so  that 

AU’i  «  AVXJ  + 

+  Y,  Cfm  ~  /m)l  5Z  —  ^km)  "b  a<?m]  ~  2(/m  —  /m)[  Y  f ~  "b  Q<lm9m] 

tn—i,j  kGNm  j 

(13) 

Now,  if  the  absolute  difference  | /,  -  /y|  is  small,  /  and  /  will  be  practically 
identical:  on  the  other  hand,  if  | /,  -  fj\  is  large,  the  changes  in  /  at  locations  t  and 
;  will  be  relatively  small  with  respect  to  this  absolute  difference.  Therefore,  we  may 
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approximate  A U]  by  the  simple  expression: 

At/; »  av;,  +  (/,  -  fi )%  -  i,j)  (14) 

which  depends  only  on  the  potentials  of  the  cliques  to  which  the  updated  line 
element  belongs,  and  on  the  current  state  of  the  depth  sites  adjacent  to  it.  If  this 
approximation  is  to  remain  valid,  the  equilibrium  condition  on  /  must  be  mantained. 
This  is  done  by  performing  M  global  deterministic  iterations  using  (10)  after  every 
global  stochastic  update  of  the  line  process.  We  have  found  experimentally  that  the 
use  of  the  approximate  expression  (14),  and  only  three  restoring  iterations  (M  =  3) 
are  sufficient  to  get  a  good  convergence  behavior. 

It  is  also  possible  to  use  assumption  (12)  and  the  fixed  point  condition  of  the 
system  (10)  to  compute  a  more  precise  approximation  to  AC/*  (the  corresponding 
formulae  are  derived  in  appendix  5.A).  Our  experiments  indicate,  however,  that 
the  simpler  approximation  (14)  gives  sufficiently  good  results,  so  that  the  increased 
complexity  incurred  by  the  use  of  this,  more  precise  scheme  does  not  seem  to  be 
justified. 

An  important  issue  in  the  practical  implementation  of  this  procedure  is  the 
determination  of  the  optimal  temperature  for  observing  the  equilibrium  behavior 
of  the  system.  We  have  found  that  this  can  be  done  effectively  in  an  adaptive  way 
by  starting  the  simulation  at  a  relatively  large  temperature  (say,  T  —  5)  and  slowly 
decreasing  it  until  the  network  shows  an  adequate  level  of  activity  (measured,  by  the 
fraction  of  sites  whose  state  is  modified  in  one  global  iteration).  We  have  found  that 
a  level  on  the  order  of  0.1  is  adequate  in  most  cases.  This  technique  is  similar  to  the 
Simulated  Annealing  method  for  finding  the  global  minimizer  of  the  energy,  but 
in  that  case,  the  cooling  of  the  system  must  proceed  at  a  slower  rate,  and  it  should 
be  continued  until  the  level  of  activity  is  reduced  practically  to  0;  if  we  proceed  in 
this  way,  the  final  state  of  the  system  will  correspond  (approximately)  to  the  MAP 
estimate.  Note  that  C/map^map)  €  F*  too,  so  that  the  mixed  strategy  described 
above  will  also  work  in  this  case  (see  Marroquin,  1984).  As  we  pointed  out  in  the  last 
chapter,  if  the  signal  to  noise  ratio  is  not  too  low,  the  configuration  corresponding 
to  the  MAP  estimate  will  be  very  similar  to  the  optimal  one  C/pm^mpm)-  From 


a  computational  viewpoint,  however,  the  optimal  estimator  is  preferable,  since  it 
exhibits  a  faster  and  more  consistent  convergence  behavior. 

5.  Experimental  Results. 

We  will  now  present  some  experimental  results  that  illustrate  the  performance  of 
the  optimal  Bayesian  estimators  for  surface  reconstruction  tasks.  In  these  examples, 
we  assume  that  we  have  the  following  prior  knowledge  about  the  nature  of  the 
surfaces  we  are  trying  to  reconstruct: 

(i)  The  region  under  consideration  can  be  segmented  into  a  small  number  of 
subregions. 

(ii)  Within  each  subregion  the  surface  is  smooth  (the  gradient  is  less  than  0.5). 

(iii)  The  boundaries  between  regions  are  piecewise  smooth.  There  are  relatively 
few  corners. 

(iv)  The  average  height  of  the  discontinuities  across  boundaries  is  greater  than 

0.8. 

(v)  The  observations  are  corrupted  by  an  additive  white  Gaussian  noise  process, 
and  we  have  some  estimate  of  its  intensity. 

This  knowledge  is  embodied  in  the  model  for  the  line  process,  and  in  the 
numerical  value  of  the  parameters.  For  our  experiments,  we  have  used  a  binary 
process  with  potentials  given  by  equation  (2). 

In  the  first  set  of  experiments,  we  generated  sparse  observation  points  at  200 
random  locations  of  a  30  X  30  rectangular  grid.  Figures  16,  17,  18  and  19  show 
(with  height  coded  by  grey  level)  the  observations  (a);  the  configuration  obtained 
by  interpolation  with  no  boundaries  (b);  the  final  reconstructed  surface  (c),  and  the 
boundaries  found  by  the  algorithm  (d),  for: 

(i)  A  square  at  height  2.0  over  a  background  at  constant  height  =  1.0  (Fig. 

16). 

(ii)  A  triangle,  with  the  same  characteristics  (Fig.  17). 

(iii)  A  tilted  square  plane  (slope  =  0.1)  over  a  constant  height  background 
with  white  Gaussian  added  noise  (c  =  0.1)  (Fig.  18). 

(iv)  Three  rectangles  at  different  (constant)  heights  over  a  uniform  background 


Figure  17.  (a)  Observations  of  a  triangle  at  height  10  over  a  background  at  height  1.0.  (a 
white  pixel  means  that  the  observation  is  absent  at  that  point),  (b)  Interpolation  with  no  boundaries 
(c)  Reconstructed  surfacc.(d)  Boundaries  found  by  the  Algorithm. 

In  many  interesting  cases,  the  observation  sites  are  not  randomly  distributed, 
but  rather  tend  to  be  clustered  along  certain  curves.  This  is  the  case,  for  example,  of 
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Figure  18.  (a)  Observations  of  a  tilted  square  (slope  =  0.1)  over  a  background  at  height  1.0 
with  added  white  Gaussian  noise  (a  =  0.1)  (a  white  pixel  means  that  the  observation  is  absent 
at  that  point),  (b)  Interpolation  with  no  boundaries,  (c)  Boundaries  found  by  the  Algorithm,  (d) 
Reconstructed  surface. 


(a) 


(b) 


(c)  (d) 


Figure  19.  (a)  Observations  of  3  rectangles  at  heights  2.0,  2.0  and  3.0  over  a  background  at 
height  1.0  (a  white  pixel  means  that  the  observation  is  absent  at  that  point),  (b)  Interpolation  with 
no  boundaries,  (c)  Reconstructed  surfacc.(d)  Boundaries  found  by  the  Algorithm. 

the  reconstruction  of  geological  structures  from  seismic  data,  or  of  certain  algorithms 


141 


Figure  20.  (a)  Observations  of  a  square  at  height  10  over  a  background  at  height  1.0  with 
added  white  Gaussian  noise  (<r  =  0.1).  White  pixels  denote  missing  observations,  (b)  Interpolation 
with  no  boundaries,  (c)  Boundaries  found  by  the  Algorithm,  (d)  Reconstructed  surface,  (c) 
Perspective  view  of  (b).  (f)  Perspective  view  of  (d). 


(c)  (d) 


Figure  21.  (a)  Disparity  data  for  a  stereo  pair  of  aerial  photographs  (data  kindly  provided  by 
W.E.I..  Grimson).  (b)  Interpolation  with  no  boundaries,  (c)  Boundaries  found  by  the  Algorithm 
(d)  Reconstructed  surface. 
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for  the  reconstruction  of  surfaces  from  stereoscopic  pairs  of  images,  when  the  stereo 
matching  is  done  only  at  the  "edges"  (places  where  the  intensity  gradient  is  large) 
detected  in  the  images.  The  synthetic  example  of  figure  20  illustrates  this  situation 
(here  we  include  also  a  perspective  representation  of  the  recontructed  surfaces,  so 
that  the  difference  between  the  smooth  reconstruction  and  the  optimal  estimate  can 
be  fully  appreciated).  In  figure  21  we  illustrate  a  real  example  of  this  situation.  It 
represents  the  interpolation  of  data  obtained  along  the  zero-crossing  contours  of 
the  convolution  of  a  stereo  pair  of  aerial  photographs  (depicting  the  campus  of 
the  University  of  British  Columbia)  with  a  "Difference  of  Gaussians"  operator,  by 
Grimson’s  implementation  of  the  Marr-Poggio  stereo  algorithm  [G4,M2],  We  will 
come  back  to  this  example  when  we  discuss  the  stereo  matching  problem  in  detail 
in  the  next  chapter. 

We  have  also  used  a  modified  Simulated  annealing  scheme  to  get  the  MAP 
estimator  for  the  same  examples  presented  above  (see  Marroquin,  1984).  The  final 
configurations  are  very  similar  to  the  optimal  ones,  so  we  do  not  reproduce  them 
here.  With  respect  to  the  computational  efficiency,  it  took,  on  the  average,  around 
450  global  iterations  (in  a  global  iteration  the  state  of  the  complete  line  field  is 
updated,  and  the  equilibrium  of  the  depth  field  is  restored)  for  the  Simulated 
Annealing  algorithm  to  converge,  while  for  the  C/pmJmpm)  estimator,  only  250 
were  needed.  Also,  in  the  latter,  the  behavior  of  the  algorithm  was  more  consistent 
in  the  sense  that  the  difference  in  the  results  from  successive  runs  with  the  same 
data  were  smaller  than  in  the  former  case. 

6.  A  Fast  Algorithm. 

The  ergodicity  of  the  "Gibbs  chain"  (the  Markov  chain  generated  by  the 
Gibbs  Sampler  or  the  Metropolis  algorithm  at  a  fixed  temperature)  means  that  its 
time  behavior  mirrors  the  ensemble  probabilistic  structure.  Since  the  probability  of 
turning  "on"  a  given  line  element  depends  on  the  difference  in  the  values  of  the 
associated  depth  elements  (i.e.,  on  the  current  value  of  the  gradient  of  the  field  /  at 
that  location),  the  configurations  with  active  lines  at  points  of  high  gradient  will  be 
generated  first.  These  lines,  in  turn,  will  decouple  the  adjacent  depth  sites,  increasing 
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the  gradient  even  more,  generating  thus  a  positive  feedback  that  stabilizes  these 
configurations  (the  opposite  happens  in  regions  of  low  gradient,  which  prevents  the 
formation  of  stable  clusters  of  lines  at  those  points). 

We  can  see,  therefore,  that  the  behavior  of  the  Gibbs  chain  can  be  thought 
of,  qualitatively,  as  performing  in  time  a  scale  separation  of  the  discontinuities 
of  the  image.  This  suggests  the  use  of  a  deterministic  scheme  that  performs  the 
same  separation,  but  compressing  the  time  of  the  Gibbs  chain.  A  simple  way  of 
implementing  this  idea,  is  to  introduce  a  time  varying  coupling  between  the  depth 
and  line  fields,  and  to  allow  only  "downhill”  moves  (i.e.,  those  with  negative  At/*) 
in  the  updating  mles  for  the  line  process.  Specifically,  we  compute  the  increment 
in  energy  associated  with  the  update  of  the  line  element  /tJ  at  time  t  using: 

At/*  =  AV-y  +  K(t){fi  -  /y)2(ity  -  l{i)  (15) 

instead  of  equation  (14),  and  accept  the  candidate  state  only  if  At/*  <  0.  The 
coupling  strength  K(t)  is  computed  using: 

K(t)  =  K0  +  ht 

(  where  K0  and  h.  are  positive  constants)  until  it  reaches  a  given  value  Kr,  and  it  is 
held  constant  at  this  value  thereafter.  The  state  of  the  depth  process  is  updated,  as 
before,  using  equation  (10).  K0  must  be  chosen  in  such  a  way  that  with  /  =  f*Q  and 
li  =  0  for  all  no  lines  will  be  turned  "on”  in  the  first  iteration.  This  means  that 
if  we  use  equation  (2)  (with  0  —  1,  and  the  values  of  4>  given  in  the  corresponding 
tables)  to  compute  the  potentials,  we  must  have: 

K o  <  —  (16) 

~  a 

where 

a  =  sup(/,-  -  fj)2 
ij 

On  the  other  hand,  the  final  value  of  K{t)  (i.e.,  KT),  must  be  such  that  no  lines  are 
introduced  in  the  smooth  regions.  Let 


where  D  is  the  set  of  neighboring  pairs  of  sites  such  that  each  site  belongs  to  a 
different  smooth  patch  (i.e.,  pairs  that  lie  across  a  discontinuity),  and  Sm  is  the 
complementary  set  of  pairs  of  adjacent  sites  such  that  both  sites  belong  to  the  same 
continuous  patch.  Kt  must  satisfy: 

0.25  0.25 

<KT  <  - 

b  c 

Note  that  even  if  we  do  not  know  the  precise  values  of  a,  6  and  c  for  a  given 
problem,  usually  we  can  estimate  them  accurately  enough  to  determine  "safe"  values 
for  K0  and  Kt.  The  value  of  h  controls  the  number  of  iterations  needed  for  the 
algorithm  to  reach  a  fixed  point;  if  h  is  too  large  and  the  observations  are  relatively 
sparse,  we  might  get  suboptimal  solutions  where  regions  with  no  observations  are 
completely  surrounded  by  lines,  and  therefore,  adopt  spurious  constant  values.  We 
have  found  experimentally  that  usually  50  iterations,  i.e.,  setting 

,  Kt  ~  Kq 


are  enough  to  produce  results  that  are  indistinguishable  from  those  produced  by 
the  Monte  Carlo  approximation. 

This  scheme  has  an  additional  advantage:  the  optimal  value  of  the  coupling 
between  the  depth  and  line  fields  (the  constant  /?  in  equation  (2))  depends  on  the 
height  of  the  discontinuities  relative  to  the  gradient  in  the  smooth  patches.  It  is, 
therefore,  a  free  parameter  of  the  Monte  Carlo  algorithm  that  must  be  adapted  to 
each  particular  problem.  Since  in  the  deterministic  scheme  it  is  varied  dinamically, 
its  adaptation  to  each  problem  is  automatic,  provided  that  we  choose  KT  and  K0 
sufficiently  large  and  small,  respectively,  so  that  the  procedure  has  practically  no 
free  parameters. 
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Figure  22.  (a)  Coloring  of  the  coupled  line-depth  lattice,  (b)  and  (c)  Elements  whose  state 
is  stored  in  each  of  the  two  types  of  processors  of  a  4-connccted  parallel  architecture. 


7.  Parallel  Implementations. 


Both  the  general  Monte  Carlo  procedure  of  section  5  and  the  deterministic 
algorithm  of  the  last  section  can  be  efficiently  implemented  in  a  parallel  architecture. 
To  study  this  implementation,  we  first  note  that  die  chromatic  numbers  (see  section 
6.2  of  chapter  3)  of  the  graphs  associated  with  die  line  and  depth  neighborhood 
systems  are  4  and  2,  respectively,  which  means  that  the  coupled  process  has  a 
chromatic  number  of  6.  In  figure  22  (a)  we  illustrate  one  possible  "coloring". 

The  colors  of  the  line  process  are  represented  by  the  numbers  1,2, 3,4,  and 
those  of  the  depth  process  by  white  and  black  circles.  The  updating  process  can 
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be  implemented  in  a  4-connected  architecture  such  as  the  "Connection  Machine", 
by  assigning  one  processor  to  each  depth  site  and  its  four  adjacent  line  elements. 
We  will  thus  have  two  different  populations  of  processors,  whose  configurations  are 
shown  in  figures  22  (b)  and  (c),  respectively.  / 

Each  complete  iteration  consist  on  6  major  cycles:  in  the  first  two,  the  state  of 
the  white  and  black  depth  variables  is  respectively  updated,  and  in  the  next  four, 
the  new  states  of  the  binary  line  variables  stored  in  (say)  the  white  processors  are 
successively  computed  and  transmitted  to  the  corresponding  memory  locations  of  the 
neighboring  black  processors.  Note  that  in  this  scheme  we  have  some  redundancy 
in  the  use  of  memory  (each  binary  variable  is  stored  twice),  but  the  state  of  all 
the  elements  needed  for  each  updating  operation  is  always  available  from  adjacent 
processors. 

7.1.  Connection  Machine  Execution  Time. 

The  update  of  each  depth  site  requires  2  (16-bit)  multiplications;  5  additions 
and  10  1-bit  comparisons,  that  is,  about  600  cycles  of  a  1-bit  processor.  The 
computation  of  the  increment  in  energy  for  the  line  process  (equation  14)  requires 
1  multiplication;  5  additions  and  13  1-bit  operations,  that  is  350  cycles.  For  the 
deterministic  algorithm,  we  require  256  additional  cycles  for  the  multiplication 
by  the  variable  coupling  constant,  while  the  exponentiation  and  random  number 
generation  needed  for  the  Monte  Carlo  updating  use  about  2300  additional  cycles 
(we  assume  that  the  updating  of  the  coupling  constant  is  done  once  every  complete 
iteration  in  the  host  computer,  and  the  new  value  broadcast  to  the  whole  network). 

Considering  that  the  Monte  Carlo  algorithm  requires  about  200  iterations  to 
converge,  while  only  50  are  needed  in  the  deterministic  case,  we  get  the  following 
approximate  estimates  for  the  total  execution  time  in  the  "Connection  Machine" 
(using  the  same  assumptions  as  in  section  6.3  of  chapter  3):  2.4  seconds  for  the 
Monte  Carlo  procedure,  and  0.18  seconds  for  the  deterministic  algorithm. 


7.2.  Analog  Networks. 


In  chapter  4  we  discussed  the  use  of  the  "neural"  networks  introduced  by 
Hopficld  (1984)  (see  also  Hopfield  and  Tank,  1985)  for  constructing  anaiog  systems 
that  approximate  the  optimal  estimators  of  binary  fields.  Since  for  a  binary  system, 
the  TPM  and  MPM  estimates  are  equivalent  (see  chapter  3),  we  can,  in  principle, 
replace  the  digital  computation  of  the  l  field  in  the  hybrid  scheme  discussed  above 
(see  figure  15)  by  a  "neural"  network  that  approximates  the  optimal  estimate  coupled 
with  the  analog  "/"  network  (note  that  the  switches  must  be  replaced  by  analog 
devices  that  implement  a  multiplication).  The  time  constant  of  the  "neural"  network 
has  to  be  adjusted  so  that  the  "f"  network  remains  in  equilibrium  and  the  search 
space  is  effectively  restricted  to  the  set  F *  (see  section  4). 

To  implement  this  idea,  we  must  define  a  new  energy  function  that  depends 
continuously  on  /,  and  whose  behavior  is  similar  to  Up  for  f,  e  {0, 1}  (Hopfield, 
1985).  One  such  function  is: 

E(f,  0  =  *£  E  (/.-  -  /,)2(»  -  ‘a)  +  <*k  £(/.  -  «)*  + 

«  iciV,  t'6S 

+C1  £  £  J*'(  £  —  I)*  +  c2  ]C  *»(*  —  *»')  + 

«  Ca:i£Cu  • 

+<*  £  £  Uij  (23) 

C»:»€Ct  /£(?*—{«} 

where  K ,a,ci,c2,c3  are  constants. 

Following  the  construction  discussed  in  section  5  of  chapter  4,  we  can  now  use 
an  analog  network  that  implements  the  dynamical  system: 

rfu t  dE 

~dt  ~  ~dfi~Ui 

k  =  e(u.) 

Where  the  function  0,  which  corresponds  to  the  gain  of  the  non-linear  amplifiers 
that  are  at  the  nodes  of  the  network,  is  as  defined  in  equation  (15)  of  chapter  4  (note 
that  in  this  case  the  network  also  contains  non-linear  elements  that  act  as  analog 
multipliers). 


We  have  performed  numerical  simulations  of  this  method,  and  the  results  are 
similar  to  the  optimal  ones  if  the  parameters  of  the  system  are  selected  appropriately. 
The  system  can  be  made  practically  data-independent  by  making  the  coupling  K 
between  the  two  networks  (see  equation  (23))  time-varying,  in  the  manner  that 
was  described  in  section  6.  We  have  found  that  a  reasonable  set  of  values  for  the 
remaining  parameters  is:  ci  =  .15;  c2  =  .05;  C3  =  1.5. 


8.  Discussion. 

In  this  chapter  we  have  studied  the  problem  of  reconstructing  piecewise 
continuous  surfaces  from  sparse  and  noisy  data.  We  showed  that  such  surfaces 
can  be  adequately  modeled  by  two  coupled  MRF’s:  A  depth  field  with  quadratic 
potentials  and  a  binary  "line”  field  with  sites  in  the  dual  lattice,  and  with  potentials 
that  represent  our  prior  knowledge  about  the  geometry  of  the  curves  that  bound 
the  smooth  patches. 

We  pointed  out  that  a  straightforward  extension  of  the  general  estimation 
procedures  derived  in  chapter  3  to  this  problem  is  computationally  unfeasible,  due 
to  the  continuous  nature  of  the  depth  fietd.  Therefore,  we  proposed  a  modified 
computational  strategy  that  is  based  on  the  fact  that  the  search  space  for  the  optimal 
estimates  can  be  restricted  to  those  configurations  in  which  the  depth  field  minimizes 
the  (quadratic)  conditional  posterior  energy  for  each  given  line  configuration.  The 
plausibility  of  this  scheme  was  demonstrated  by  experimental  results  showing  the 
reconstruction  of  both  synthetic  and  "real"  surfaces. 

We  also  derived,  based  on  heuristic  arguments,  a  fast  deterministic  algorithm 
with  excelent  experimental  performance,  and  whose  parameters  can  be  made 
problem-independent,  and  discussed  the  implementation  of  all  these  procedures  in 
parallel  digital  machines,  and  in  hybrid  and  analog  networks. 

It  is  interesting  to  compare  the  techniques  we  have  presented  with  other  surface 
reconstruction  methods  that  handle  discontinuities.  The  most  successful  of  these 
(see  Terzopoulos,  1984)  are  based  on  the  idea  of  interpolating  a  smooth  surface 
first  and  then,  detecting  the  discontinuities  by  a  threshold  mechanism.  We  believe 
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that  the  method  that  we  are  proposing  has  some  advantages  over  this  scheme  which 
justify  its  use  in  spite  of  the  increased  computational  cost: 

(i)  From  a  conceptual  viewpoint,  it  is  better  to  perform  the  interpolation  and 
boundary  detection  tasks  at  the  same  time,  rather  than  approximating 
an  everywhere  smooth  surface  first,  since  this  operation  hides  the 
discontinuities  that  one  then  tries  to  find  in  the  second  phase. 

(ii)  In  our  method,  the  values  of  the  parameters  depend  only  on  the  average 
height  of  the  jumps  that  one  wants  to  consider  as  boundaries  in  the 
reconstructed  surface,  and  thus,  they  are  independent  of  the  location  of  the 
observations.  If  these  are  sparsely  located,  even  when  the  discontinuity  is 
relatively  large,  the  threshold  method  may  fail. 

(iii)  A  priori  knowledge  about  the  shape,  orientation  and  position  of  the 
discontinuities  can  be  easily  incorporated  by  choice  of  the  potentials  of 
the  line  process.  This  fact  makes  our  method  particularly  promising  for 
integrating  information  from  qualitatively  different  sources  into  a  single 
unified  estimation  procedure. 

(iv)  The  same  algorithm  can  be  used  for  surface  interpolation,  noise  elimination 
(smoothing)  and  boundary  detection. 

We  will  now  study  a  related  problem:  the  reconstruction  of  surfaces  from 
stereoscopic  images. 
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Appendix  5.A 


HIGHER  ORDER  APPROXIMATION  TO  At/* 


In  this  appendix  we  describe  a  higher  order  approximation  to  the  energy 
increment  At/*  (see  section  4  of  chapter  5).  We  will  compute  A U*  using: 


A U\  »  AV-y  + 


+  E  Om-  fm)[  H  (1  “  4m)  +  “9m]  ~  2(/m  -  /m)[  £  /fc(l  -  4m)  +  «<7 m9m] 

m  =  i,j  k£N„  k£Nm 

(1) 


using  the  assumption: 

/p^/p  for  p  7^  t,  j 

the  new  equilibrium  configuration  /  can  be  estimated  by  the  following  formulas, 
which  correspond  to  the  fixed  point  of: 


jk+i)  _ 

J  i 


EieNj  +  <*Mi 

SigiVj-  T  otqj 


(2) 


when  /p,  p  7^  i,  j  is  held  fixed: 
Let: 


hm  = 


fl-4m, 
U  -  4m> 


for  k,m  =  i,  j 
otherwise 


7m  —  )  '  4m  4"  OC(Jm 

k&Nm 


The  new  equilibrium  configuration  will  be  a  fixed  point  of  (10),  and  therefore,  it 
will  satisfy: 
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1 

7m 


E 

fce/Vm 


f-kmJ rn  +  Q!9m9m 


for  m  =  i,  j 


[f  lxj  =  0  and  7,7 j  7^  1,  we  get 


!C  fkhk  +  otqtfi 

keNi-{j) 


+  Y1  fki-jk  +  OtQjQj} 


keNj-{i} 
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/;  =  - 


5Z  A*y*  +  «ffy9y 

L*e/vy 


if  /,y  =  0  and  7,7^  =  1,  it  means  that  there  are  no  observations,  neither  at  i  nor  at 
>,  and  that  these  two  sites  are  isolated  from  the  rest  of  the  lattice  by  line  elements. 
Therefore,  we  use: 

f*  +  fj 


h  =  h 


Finally,  if  /,y  =  1,  we  put 


(1 

7m  ^keNm  fmt 
fm. 


4 km  T  aeIm9rn 


If  7m  7^  0 
if  7m  =  0 

for  m  —  z,  j. 

Besides,  if  the  move  from  l  to  2  is  accepted  by  Metropolis  criterion,  we  replace 


/m  —  /„ 


for  to  =  j 


As  described  in  chapter  5,  after  all  l  sites  have  been  updated,  M  restoring 
iterations  using  equation  (10)  of  that  chapter  should  be  applied. 


Chapter  6 


SIGNAL  MATCHING 


1.  Introduction. 

In  all  the  estimation  problems  we  have  studied  so  far,  the  posterior  energy 
function  had  the  form: 


» 

where  Co(/)  corresponded  to  the  MRF  model  for  the  field  /.  The  functions  <P,, 
whose  precise  form  depended  on  the  particular  noise  model,  were  non-decreasing 
functions  of  the  distance  between  /,•  and  g,  (see  equation  (2)  of  chapter  3): 


*<(/,  ?.)  =  -  In  Pni(*-\gi ,//,(/)) 


There  are  some  cases,  however,  when  the  conditional  probability  distribution 
of  the  observations  Pg\/{g]  /)  is  multimodal  (as  a  function  of  /)  which  causes  the 
functions  <J>,  to  be  non-monotonic,  so  that  the  solution  to  the  problem  remains 
ambiguous,  even  if  the  observations  are  dense,  and  the  signal  to  noise  ratio  arbitrarily 
high.  To  illustrate  this  situation,  we  will  study  an  important  instance  of  it:  the 
"signal  matching”  problem,  whose  one-dimensional  version  is  as  follows: 

Consider  two  one-dimensional,  real  valued  sequences  hL,hR,  where  hL  is 
obtained  from  h n  by  shifting  some  subintervals  according  to  the  "disparity  sequence" 
d : 

MO  =  +  <k) 

with 

d{  €  Q  =  {-m,  -m  +  1, . . -1,0, 1, . . .,  m} 


The  signal  matching  problem  is  to  find  d  given  hL,ht{.  (In  a  more  realistic 
situation,  we  do  not  observe  hL,hu  directly,  but  rather  some  noise-corrupted 
versions  gi.,gu).  Some  interesting  instances  of  this  problem  are  the  matching  of 
stereoscopic  images  along  epipolar  lines  (Marr  and  Poggio,  1976);  the  computation 
of  the  dip  angle  of  geological  structures  from  electrical  resistivity  measurements 
taken  along  a  bore  hole,  and  the  matching  of  DNA  sequences. 

To  make  the  discussion  more  specific,  we  will  consider  a  simple  example,  in 
which  the  sequences  /i/,,  hp  are  binary  Bernoulli  sequences;  we  will  assume  that  the 
noise  corruption  process  can  be  modeled  as  a  binary  symmetric  channel  with  known 
error  rate,  and  that  d  is  known  to  be  a  piecewise  constant  function.  A  well  known 
instance  of  this  problem  is  the  matching  of  a  row  of  a  random  dot  stereogram  with 
density  p  (Julesz  (I960)),  when  the  components  of  the  stereo  pair  are  corrupted  by 
noise. 

The  stochastic  model  for  the  observations  is  then  constructed  by  assuming  that 
the  right  image  is  a  sample  function  of  a  Bernoulli  process  A  with  parameter  p  : 

<7/i(*)  =  >*(*) 


The  left  image  is  assumed  to  be  formed  from  the  right  one  by  shifting  it  by  a 
variable  amount  given  by  the  disparity  function  d,  except  at  some  points  where  an 
error  is  commited  with  probability  e.  Note  that  some  regions  that  appear  in  the  right 
image  will  be  occluded  in  the  left  one  (see  figure  23).  The  "occlusion  indicator"  <f>d 
can  be  computed  deterministically  from  d  in  the  following  way: 

fl,  if  d{-k  >  d<  +  k,  for  some  integer  k  e  (0,ml 

=  “  (2) 

(0,  otherwise 

The  occluded  areas  are  assumed  to  be  "filled  in"  by  an  independent  Bernoulli 
process  B.  The  final  model  is  then: 


9l[  0  = 


9ft(*  +  4)> 

1  -  gn( »  +  <*,-)• 

BM 


with  prob.  1  -  e,  if  =  0 
with  prob.  e,  if  ^(t)  =  0 
with  prob.  1,  if  ^(t)  =  1 
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(3) 


«L 


Figure  23.  Occluded  Regions:  The  horizontal  and  vertical  axis  represent  points  in  one  row 
of  the  left  and  right  images,  respectively.  Matching  points  arc  represented  by  black  circles.  Any 
match  in  the  shaded  region  will  occlude  the  point  i 

Note  that  in  the  two-dimensional  case,  the  index  i  denotes  a  site  of  a  lattice,  and 
therefore  it  can  be  represented  as  a  two-vector  (* i ,  *2)  whose  components  denote 
the  column  and  row  of  the  site,  respectively.  To  simplify  the  notation,  we  will  adopt 
the  following  convention  throughout  this  chapter:  when  a  scalar  is  added  to  this 
vector  index  (as  in  ga{i  +  c/t)  and  di+k),  it  will  be  implicitly  assumed  that  it  is 
multiplied  by  the  vector  (1,0)  (so  that  the  above  expressions  should  be  understood 
as  gu{i  4-  (di,  0))  and  di+(ki 0),  respectively).  Using  this  convention,  the  observation 
model  of  equation  (3)  can  be  applied  either  to  the  one  or  to  the  two-dimensional 
cases. 


Notice  that  even  if  the  observations  are  noise-free  (e  =  0)  the  solution  of  the 
problem  remains  ambiguous,  and  it  cannot  be  uniquely  determined  unless  some 


prior  knowledge  about  d  (lor  example,  in  the  form  of  a  MRF  model)  is  introduced. 
Hie  use  of  a  MRF  model  in  the  stereo  matching  ease,  corresponds  to  a  quantification 
of  the  assumption  of  the  existence  of  "dense  solutions"  (this  term  was  introduced 
by  Jules/  (1%0),  and  essentially  corresponds  to  the  assumption  that  the  disparity  d 
varies  smoothly  in  most  parts  or  the  image;  see  also  Marr  and  Poggio  (1979)),  and 
the  use  of  the  occlusion  indicator  coi responds  to  the  "ordering  constraint"  (i.e.,  the 
requirement  that  if  i  >  j,  then  t  dx  >  j  +  dr  see  Baker  (1981);  we  put  <t>d  —  1 
whenever  this  constraint  is  violated). 


2.  Bayesian  f  ormulation. 

To  formulate  the  estimation  problem,  we  will  consider  the  sequence  gi  as 
"observations",  while  gt{  w  ill  play  the  role  of  a  set  of  parameters.  Thus,  from  (3), 
we  have  (assuming,  for  simplicity  that  p  =  5): 

P(9l{  i)  =  k\d,gR)  =  P9ld(k)  = 

1  -  e,  if  <f>d(i)  =  0  and  gR(i  +  d,-)  =  k 

=  t,  if  —  0  and  gR{i  +  d,)  7^  k 

l  ir^(0  =  1 

The  posterior  distribution  Pd\g  will  then  be: 

Pd  •  Pg\d 

W  =  — r- - 

rf 

=  4-  exP  -Jr  £  v(d<>  di )  ■  IKK1  -  -  9ii{*  +  d< ))  + 

-*0  i,j  i 

+<(1  -  «(«.(•)  -  gn(i  +  <0)1(1  -  «•))  + 


where 


-{'• 

lo. 


1,  if  x  =  0 


otherwise 


As  a  prior  model  for  the  disparity  field,  we  may  use  a  first  order  MRF  with 
generalized  Ising  potentials,  such  as  the  one  presented  in  chapter  4.  Other  models 


1 


may  also  be  used,  including  the  coupled  depth  and  line  fields  that  we  discussed  in 
the  previous  chapters.  For  the  present,  let  us  assume  that  the  simpler  Ising  model  is 
adequate.  Note  that  even  when  the  matching  problem  is  one-dimensional  (we  are 
asuming  that  there  is  no  vertical  disparity  between  the  images,  so  that  the  matching 
can  be  done  on  a  row-by-row  basis),  the  two-dimensional  nature  of  the  prior  MRF 
model  for  the  disparity  introduces  a  coupling  between  matches  at  adjacent  rows. 
The  posterior  energy  is: 

Ur(d;  g)=^rZ  W,  <*y)  -  £  In{[(l  -  e)S(gL(i)  -  gu(i  +  *))  + 
io  i,j 


+«(1  -  %L(0  -  9di  +  *))](!  -  m)  + 


Using  the  fact  that  for  any  a,  6  ^  0 


ln[a£(x)  +  6(1  —  <5(x)j  =  J(x)  In  a  +  (1  —  6(x))  In  6 


we  can  write  an  equivalent  expression  for  Up  (modulo  an  additive  constant): 

UM  t)  =  i  Y.  V (Ji,  if)  +  t  £  Mi)  In  2  + 


+  2  D1  -  MWteUi)  ~  9r(*  +  di)) 


where 


a='niih) 


3.  Optimal  Estimator. 


It  is  possible  to  apply  the  general  Monte  Carlo  algorithms  developed  in  chapter 
3  to  approximate  the  optimal  estimate  d  with  respect  to  a  given  performance  measure 
(such  as  the  mean  squared  error).  Their  use  in  this  case,  however,  is  complicated  by 
the  introduction  of  the  occlusion  function  <t>d  in  the  posterior  energy:  the  size  of  the 
support  for  this  function  equals  the  total  number  of  allowed  values  for  the  disparity 
(see  equation  (2)).  If  this  number  is  large,  the  computation  of  the  increment  in 


energy,  or  of  the  conditional  distributions  (if  the  Gibbs  Sampler  is  used)  may  be 
quite  expensive.  In  many  cases,  however,  the  size  of  the  regions  of  constant  disparity 
is  relatively  large  compared  with  the  size  of  the  occluded  areas.  In  these  cases,  one 
can  approximate  the  posterior  energy  by: 

Ur(J)  =  ~  E  V(d„  dj)  +  -Y.  %/.W  -  »#(•'  +  di))  (5) 

'»  <.i  2  i 

and  increase  significantly  the  computational  efficiency  (we  have  successfully  used 
this  approach  to  reconstruct  the  disparity  of  random  dot  stereograms). 

In  the  one-dimensional  case,  it  is  also  possible  to  extend  the  dynamic 
programming  methods  described  in  appendix  4.B  to  compute  the  MAP  estimate; 
this  extension  is  described  in  appendix  6.A. 

An  alternative  approach  to  the  solution  of  this  problem  is  to  implement  the 
local  constraints,  generated  by  the  prior  MRF  model,  directly  in  a  deterministic 
"cooperative  network"  of  a  given  form  (a  "Winner-Takes-All"  network)  whose 
fixed  point  will  correspond  to  the  optimal  solution.  This  will  be  done  in  section 
6.  First  we  present,  in  section  4  the  definition  of  a  "Cooperative  Algorithm",  and 
describe  and  analyze,  in  section  5,  the  previous  work  that  has  been  done  in  this 
connection. 

4.  Cooperative  Algorithms. 

Consider  the  two-dimensional  signal  matching  problem  defined  in  section  2, 
and  suppose  that  to  each  site  i  of  the  lattice  n  we  associate  a  set  of  binary  variables: 
{/,,</,  d  e  Q}  (we  will  call  this  set  the  "ith  column"  of  the  network  /,  and  the  set: 

i  6  fl},  the  "disparity  layer  d "  of  the  same  network). 

If  a  particular  variable  =  1,  it  means  that  we  assign  to  site  i  the  disparity 
d  (note  that  more  than  one  disparity  may  be  assigned  to  a  node  at  a  given  time). 

A  "Cooperative  Algorithm"  (Marr  and  Poggio,  1976;  it  is  also  known  as  a 
"Cellular  automata";  see  Wolfram,  1983)  is  a  rule  for  updating  the  state  of  the 
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network  /.  It  can  be  represented  formally  as: 

fUt  +  i)  =  FUM<) 

with  the  additional  requirement  that  the  interactions  should  be  local,  that  is: 

*U/(0. 0  =  FiAifjAtYi  €  N»  3  G  «}.  0 

where  AT,-  is  the  (two-dimensional)  neighborhood  of  site  i  €  ft.  The  idea  is  to  define 
the  functions  F  (i.e.,  the  connections  of  the  cooperative  network)  in  such  a  way  that 
the  following  local  constraints  are  implemented: 

(i)  Compatibility  with  the  observations:  Each  element  fir  should  receive 
an  "excitatory"  external  input  proportional  to  the  conditional  probability 
Er(g/.(*')  =  gn{i  +  r)  |  d{  =  r). 

(ii)  Smoothness:  This  corresponds  to  an  implementation  of  the  MRF  prior 
model  for  the  disparity:  the  likelihood  that  an  element  /t)<*  is  turned  "on" 

(i.e.,  is  set  equal  to  I)  should  increase  if  the  elements  6  /V,}  are 

"on"  {Nt  is  the  neighborhood  of  i  in  the  Markov  model),  so  that  excitatory 
connections  should  exist  between  these  elements. 

(iii)  Uniqueness:  Since  in  the  final  configuration  /*  one  and  only  one  element 
of  each  column  {f’](l,d  €  Q)  should  be  equal  to  1,  each  element  should 
have  "inhibitory"  connections  with  the  other  elements  of  the  sane  column. 

The  operation  of  the  network  will  be  Synchronous  if  all  its  elements  are  updated 
in  parallel  at  the  same  time,  and  Asynchronous  if  they  are  updated  sequentially, 
one  at  a  time.  Note  that  one  synchronous  iteration  is  equivalent  to  |/|  (the  number 
of  elements  of  the  network  /)  asynchronous  ones  (we  will  refer  to  |/|  succesive 
iterations  as  a  Global  Iteration),  and  that  the  evolution  of  the  asynchronous  network 
will  depend,  in  general,  on  the  order  in  which  its  elements  are  updated. 

5.  "Linear  Threshold"  Networks. 

The  first  successful  application  of  this  approach  (although  not  formulated  in 
probabilistic  terms)  is  the  algorithm  developed  by  Marr  and  Poggio  (1976)  for  the 


stereo  disparity  computation.  They  proposed  a  binary  network  of  the  form: 


fi{t  +  1)  =  o{Px) 


with  p,  = 


Y,  />(0™u  +V  i-0 


i,j  enxQ; 


(6) 


°{p) 


_  ri.  if  p  > 

10,  otherwij 


>  0 

otherwise 

wt]  satisfying  wtj  =  w}il  for  all  i,j(zttXQ 


and  fi  G  {0, 1},  for  all  i 


The  parameters  wtJ,  p,  and  0  must  be  chosen  in  such  a  way  that  the  constraints 
to  the  solution  of  our  problem  are  implemented  locally.  In  particular,  the  smoothness 
constraint  is  implemented  by  defining: 


wx,d,y,d  I- 1  k)r  y  G  Nx  !  x,  y  G  f2 


where  Nx  is  an  excitatory  neighbourhood  of  x.  The  uniqueness  constraint,  by: 

wx,d,y,d'  —  -e,  for  (y,  d’)  G  Mx,d 


with  MXij  an  inhibitory  neighbourhood  corresponding  to  multiple  matches  at  x  (see 
Marr  and  Poggio  (1976)  for  a  precise  definition  of  these  neighbourhoods),  and 


wx,d,v,d'  =  0  elsewhere. 


The  compatibility  with  the  observations  is  enforced  by  putting 

o  fl.  if  9u{x  +  d)  =  gL{x) 
Piyd  —  I  x,d  —  | 

10,  otherwise 


(7) 


Although  it  has  not  been  possible  fo  this  date  to  find  a  rigorous  proof  for  the 
convergence  of  this  algorithm,  numerical  experiments  and  a  probabilistic  analysis 
(Marr  et.  al..  1978)  show  that  the  synchronous  network  defined  above  will  converge  to 
reasonably  good  solutions  for  random  dot  stereograms  portraying  piecewise  constant 
surfaces.  However,  this  scheme  has  several  problems  (although  some  modifications 
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to  get  around  them  are  suggested  in  Mart  and  Poggio,  1976  and  in  Marr  et.  al„ 
1978): 


In  the  first  place,  the  quality  of  the  results  degrades  very  fast  as  the  density  of 
the  tokens  in  the  stereogram  decreases.  Besides,  it  is  not  clear  how  to  extend  this 
formulation  to  the  more  interesting  cases  of  slowly  varying  disparities,  and  different 
types  of  tokens  placed  in  points  that  do  not  correspond  to  a  regular  lattice. 


5.1.  Asynchronous  Algorithms. 


We  now  consider  algorithms  of  the  form  (6)  that  operate  asynchronously.  In 
this  case,  it  has  been  shown  (Hopfieid,  1982)  that  if  we  choose  the  parameters  in 
such  a  way  that  p,  is  never  0  (this  can  be  done,  for  example,  if  ta.y  and  p,-  are 
integers,  by  giving  0  a  non-integer  value),  the  "Energy"  function: 

m  =  -  \  E  «'<>/</>  -  E  /iO>i  -  *)  (8) 

1  hi  * 


will  decrease  monotonically  at  every  global  iteration  of  the  asynchronous  algorithm 
in  which  the  state  of  every  element  is  updated,  unless  the  network  is  at  a  fixed 
point. 


It  is  interesting  to  note  that  with  the  parameter  definitions  given  above  for  the 
stereo  problem,  the  term 


0fxyd  X/ 

1  V€N, 


in  (8)  will  be  negative  only  if  all  the  spatial  neighbors  of  the  cell  x  on  the  same 
disparity  layer  are  "on",  and  therefore  corresponds  to  a  smoothness  constraint.The 
term 


corresponds  to  the  compatibility  with  the  observations,  and  the  remaining  terms: 


fr,,l  0  +  ~  Y1  f\lA' 
£  M„d 


may  be  considered  as  an  implementation  of  the  uniqueness  constraint,  since  their 
minimization  requires  that  we  have  as  few  "on"  cells  as  possible,  and  it  penalizes 
explicitly  the  local  non-uniqueness  of  the  solution. 

5.2.  Experimental  Performance. 

To  study  the  performance  of  these  algorithms,  we  implemented  a  simulator 
of  both  the  synchronous  and  asynchronous  networks.  The  "stimulus"  used  for  the 
set  of  experiments  performed,  was  a  random  dot  stereogram  portraying  a  square  of 
21  X  21  elements  floating  at  disparity  -2  in  front  of  a  flat  background  at  disparity  0. 
Figure  24  shows  this  stereogram  and  the  fixed  points  obtained  by  the  synchronous 
and  asynchronous  algorithms. 

In  both  cases,  the  behaviour  of  the  algorithm  shows  two  distinct  phases:  In  the 
first  iteration,  most  of  the  elements  that  are  "on"  on  the  wrong  layers  (and  some  on 
the  correct  ones)  arc  turned  "off"  (see  figure  24-b).  As  a  result  of  this,  at  succeding 
iterations,  the  probability  of  having  a  cluster  capable  of  growing  is  relatively  high 
for  the  correct  regions,  which  begin  to  fill  in,  and  very  small  for  the  wrong  ones, 
for  which  the  remaining  "on"  cells  are  turned  "off". 

This  form  of  operation  causes  that  the  precise  shape  of  the  boundaries  between 
regions  will  depend  on  the  exact  shape  and  location  of  the  random  clusters  that  tire 
formed  after  the  first  iteration  on  the  correct  layers.  Also,  it  is  easy  to  see  that  the 
form  of  the  inhibitory  neighbourhood  (see  Marr  and  Poggio  (1976))  causes  the  cells 
lying  on  wrong  layers  along  a  narrow  band  near  the  edges  of  the  background  to  be  on 
the  average  less  inhibited  by  the  "on"  elements  in  the  correct  layers  (which  in  turn 
are  less  stimulated)  than  the  interior  points,  making  thus  more  likely  the  formation 
of  wrong  stable  clusters  in  these  regions.  This  effect  is  more  pronounced  in  the 
asynchronous  case,  since  a  wrong  cell  that  is  left  "on",  can  increase  the  excitation 
of  a  neighbouring  one  on  the  same  global  iteration,  increasing  the  likelihood  of  a 


163 


Figure  24.  (a)  Random  dot  stereogram  portraying  a  21  x  21  square  at_  disparity  -2.  (b) 
State  of  the  network  after  one  iteration  of  the  synchronous  algorithm,  (c)  Fixed  point  for  the 
Synchronous  Algorithm,  (d)  Fixed  point  for  the  Asynchronous  Algorithm. _ _ 


stable  cluster,  whereas  on  the  synchronous  case,  all  the  cells  of  the  cluster  must  be 
left  "on"  at  the  same  time. 

For  the  values  used  for  the  parameters  (e  =  2,0  =  3.5  ;  sec  Marr  and  Poggio, 
1976)  the  energy  defined  in  (8)  decreases  monotonically  at  each  global  iteration 
of  the  asynchronous  network,  and  thus,  it  converges  to  a  configuration  that  is  a 
local  minimizer  of  this  function.  The  correct  solution  will  also  correspond  to  a 
(different)  local  minimum;  it  is  interesting  to  note,  however,  that  in  general  it  will 
not  be  the  global  one.  It  is  easy  to  show,  for  example,  that  if  the  random  dot 
stereogram  portrays  a  region  that  has  a  ratio  of  area/perimeter  less  than  a  critical 
value  (for  the  current  value  of  the  parameters  this  critical  ratio  is  «=:  13),  this  region 
will  not  be  distinguished  from  the  background  in  the  configuration  that  globally 
minimizes  the  energy.  This  means  that  the  use  of  simulated  annealing  to  minimize 
(8)  will  not  necessarily  improve  the  solution;  however,  we  have  found  that  after  the 
deterministic  algorithm  has  converged,  a  few  iterations  of  Metropolis  algorithm  at 
a  moderate  temperature  (^  l)  may  be  very  effective  for  removing  the  clusters  at 
wrong  layers.  This  is  illustrated  in  figure  25. 

1  •  ' 

6.  Winner-Takes-All  (WTA)  Networks. 


Linear  threshold  networks  are  not  the  only  form  of  local  implementation  of  the 
constraints  generated  by  the  probabilistic  formulation  of  our  problem.  A  different 
possibility  is  to  associate  with  each  column  d  £  Q}  a  binary  "Winner-take-all” 
synchronous  network: 


The  input  u(x,  d)  to  each  cell  corresponds  to  the  excitatory  input  in  the  linear 
threshold  case,  that  is,  to  the  local  implementation  of  the  smoothness  constraints 
and  the  compatibility  with  the  observations. 

The  inhibitory  terms  (the  uniqueness  constraint)  are  implemented  in  the  form 
of  a  WTA  mechanism.  The  output  (the  new  value  of  fXtj )  is  given  by: 


i 


f t,d  — 


if  u(x,  d )  =  ma xj'eQ  u(x>  ^’) 
otherwise 


(9) 
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Figure  25.  (a)  Fixed  point  at  T  =  0.  (b)  State  after  4  iterations  at  T  =  1.  (c)  Fixed  point  at 
T  —  0  with  (b)  as  initial  state.  _ _ _ 

Tli is  means  that  fx4  will  be  "on"  at  time  t  + 1  only  if  it  is  maximally  stimulated 

with  respect  to  all  the  other  elements  in  the  same  column  at  time  t,  and  if  it  is 

"compatible  enough"  with  the  observations  (see  figure  26). 


This  design  has  several  advantages  : 

1.  For  dense  stereograms,  we  will  show  that  it  converges  to  the  correct  solution 
in  a  small  number  of  iterations. 

2.  For  sparse  stereograms,  the  algorithm  will  give,  with  high  probability,  the 
correct  disparity  at  every  location  in  which  a  matching  token  is  present 

3.  It  exhibits  a  good  performance  with  natural  images  portraying  piecewise 
constant  surfaces. 

4.  It  is  not  necessary  to  process  the  whole  domain  Q  at  the  same  time;  a 
complete  representation  may  be  built  up  by  defining  local  networks  corresponding 


■  *  *  -  ■  •  *  -  *  * '  •  *,*  *,•  * .■  *-*■  /  .*  *r«  »*«  " 

-'v  v'v.-  *,*  .vi.Ov  •-'  -v' 


167 


to  overlapping  subregions  that  cover  fi.  This  feature  enables  the  algorithm  to  process 
arbitrarily  large  images. 

5.  It  can  be  extended  in  such  a  way  that  it  can  handle  more  complex  situations, 
such  as  transparent  and  piecewise  smooth  surfaces. 

Qualitatively,  this  improved  performance  can  be  explained  as  follows: 

Unlike  the  linear  threshold  design,  in  the  first  iteration  the  WTA  algorithm  will 
only  turn  "ofT"  cells  that  do  not  lie  in  the  correct  disparity  layers.  This  will  cause 
the  cells  that  lie  at  the  boundaries  of  clusters  at  the  wrong  layers  to  lose,  in  the 
subsequent  iterations,  against  the  corresponding  strongly  stimulated  cells  that  lie  in 
the  interior  of  the  "correct"  regions.  This  will  result  in  a  progressive  shrinking  of 
the  wrong  clusters,  and  will  end  up  with  their  disappearance. 

This  results  in  a  faster  convergence,  since  the  size  of  the  clusters  that  have  to 
be  killed  is  in  general  smaller  than  the  size  of  the  regions  that  the  linear  threshold 
algorithm  has  to  fill  in.  Also,  the  boundaries  between  constant  disparity  regions  will 
be  more  accurately  localized. 

The  only  situation  in  which  this  behavior  will  not  take  place,  is  when  there  is 
a  significant  overlap  (due  to  accidental  correlations  in  the  images)  between  regions 
lying  at  different  depths.  In  this  case,  the  algorithm  will  not  be  able  to  solve  the 
ambiguity  correctly  based  only  on  smoothness  considerations,  and  it  will  locate  the 
boundary  at  a  position,  within  the  region  of  overlap,  which  will  depend  on  the 
detailed  shape  of  this  region.  Also,  the  solution  will  not  be  so  clean  in  this  case;  a 
few  cells,  corresponding  to  different  disparities  at  the  same  spatial  position,  may  be 
left  "on"  in  the  final  state  (limit  cycles  involving  some  of  these  few  cells  are  also 
possible). 

This  type  of  ambiguity  (accidental  overlap)  is  relatively  frequent  in  sparse 
stereograms.  However,  the  regions  of  overlap  are  typically  "blank"  regions  (i.e., 
without  tokens),  and  the  algorithm  will  give  the  correct  disparity  at  all  token 
locations. 

We  will  now  make  these  considerations  more  precise.  First,  we  will  need  some 
definitions. 


1.  n  will  he  defined  as  the  set  of  points  lying  on  a  finite  square  lattice. 

2.  We  will  use  a  second  order  MRK  with  ising  potentials  as  the  prior  model  for  the 
disparity  field.  Therefore,  for  each  x  G  n.  we  define  its  neighborhood  as: 

Nx  =  nf){y  :  0  <  |i  —  i/|  <  2}  (10) 

3.  Given  a  region  R  C  fi,  we  define  the  set  of  its  interior  points  (with  respect  to 
Nx)  I(R)  as  the  set  of  points  in  R  such  that  all  its  neighbors  also  belong  to  R  : 

/(/?)=  {zeR:\Nxf)R\  =  \Nx\} 


In  a  similar  way  we  define: 


I2(R)  =  I(I(R)) 


and  so  on.  We  call  the  points  in  R  that  are  not  interior:  x  £  R  -  I{R),  Boundary 
points  of  R.  We  will  say  that  a  region  R  is  connected  if,  given  any  two  sites  i,j  E  R, 
we  can  find  a  sequence  of  sites  {i  =  *o.  •  •  •>  *p  =  ;},  wilh  **  E  R  for  k  —  1, . .  .,p, 

such  that  ik  E  Nl(,+l  for  k  =  0, . . .,  p  -  1. 

4.  Given  a  region  R  C  fi,  we  define  its  Diameter  D{R )  (with  respect  to  Nx )  as  the 
smallest  integer  such  that: 

jD(H)+\[R)  =  0 

Alternatively,  if  we  define  an  algorithm  that  deletes  all  the  boundary  points  of  a 
region  at  every  step,  the  diameter  of  the  region  is  the  minimum  number  of  steps 
necessary  to  completely  delete  the  region. 


5.  The  initial  state  of  the  network  will  be  given  by: 

f  1.  if  9r{x  +  d)  =  <7/.(z) 


Ml: 


otherwise 


(11) 


6.  The  WTA  algorithm  for  this  problem  will  have  the  particular  form: 

if  ,d(t)  —  ma xd<<zQ  ux>j’(t) 


fz,d{k  +  1) 


=  {*• 
lo, 


otherwise 


MO  =  +  E  MO 

VtN* 


(12) 


169 


7.  We  will  assume  that  the  set  0  can  be  covered  by  M  + 1  non-overlapping  regions: 


n  =  /e,(J 


and  that  the  correct  solution  (i.e.,  the  way  the  stereogram  was  generated)  consists  in 
assigning  to  every  point  in  ft,  the  depth  </,: 

fxA  = 1  *7/  xeRi 

The  set  O  corresponds  to  the  union  of  all  the  regions  that  are  occluded  in  the  left 
image  (see  figure  23),  and  therefore,  for  every  x  e  O,  any  depth  assignment  will  be 
considered  "correct". 

8.  Since  we  are  assuming  that  the  observations  are  perfect,  the  loading  rules 
guarantee  that 

f°x  d.  =  1  for  every  x  e  Ri 
However,  in  many  cases  we  will  also  have: 

f°zd.  —  l  for  some  x  e  Ri  and  dj  ^  d{ 

due  to  accidental  correlations  in  the  images.  A  connected  set  Wj  defined  as: 

Wj  —  { x  :  f°xd.  =  1  and  x  €  Ri  for  some  dj  ^  di} 

will  be  called  a  wrong  cluster  on  layer  j  of  Ri. 

9.  We  will  say  that  a  stereogram  has  well  defined  boundaries  if  there  are  no  large 
wrong  clusters  overlapping  the  boundaries  between  adjacent  regions.  This  means 
that  every  non-occluded  point  must  have  at  least  as  many  "on"  neighbors  at  time  0 
on  the  correct  layer  as  in  any  other  layer,  i.e.,  for  every  region  Rk  and  every  point 
xeRk . 

£  A U  2:  £  for  (i3) 

y€/V,  y€tV, 

10.  A  stereogram  will  be  said  to  be  unambiguous  if  for  every  region  Ri  and  every 
wrong  cluster  Wj  there  is  at  least  one  point  x  €  WjftRi  which  has  less  "on" 


$ 


neighbors  at  time  0  on  the  wrong  layer  d}  than  in  the  correct  one  dx,  i.e.. 


E  <  E  f°,A  (1-1) 

yG/V,  y€  N, 

We  can  now  establish  the  following  result: 

Convergence  Theorem:  Given  an  unambiguous  random  dot  stereogram  with  perfect 
observations  (0  error  rate)  portraying  M  non-overlapping  regions  of  constant  depth 
with  well  defined  boundaries,  the  WTA  algorithm  (12)  with  a  >  8  will  converge  to 
the  correct  solution  in  K  iterations,  where  K  is  the  diameter  of  the  largest  wrong 
cluster  in  Q. 

Proof: 

1)  First,  we  note  that  condition  (13)  guarantees  that  all  the  cells  on  the  correct  layers 
(which,  by  (11),  are  "on"  at  time  0)  will  remain  "on"  at  time  1, 

2)  Condition  (14)  and  the  definition  (12)  guarantee  that  for  every  wrong  cluster  Wj 
on  every  region  R{  there  will  be  at  least  one  point  x  that  will  be  turned  "off"  in  the 
first  iteration.  Then,  for  all  points  y  €  Nx  f|  Wy  fl  ft,  we  will  have: 

E  <  E  4'i 


which  implies  that  f^dj  =  0. 

A  recursive  application  of  this  reasoning  establishes  the  theorem,  i 

Remarks: 

I.  For  occluded  regions,  there  will  be  no  large  clusters  of  "on"  cells  in  any  layer  of 
f°,  and  since  the  form  of  (12)  precludes  the  growth  over  regions  with  /°  =  0,  if 
there  are  any  isolated  points  for  which  fQxd  —  1,  they  will  remain  "on"  in  /*  (the 
fixed  point  of  (12));  otherwise,  /*  =  0  uniformly  over  these  regions. 


2.  If  the  algorithm  has  ambiguous  boundaries,  we  can  still  use  this  theorem  to 
guarantee  the  convergence  of  the  WTA  algorithm  to  the  correct  solution  outside  the 
overlap  regions.  It  is  clear  that  if  we  define  new  non-overlapping  regions  R\ , . .  .RM 
with  non  ambiguous  boundaries,  and  include  the  overlap  areas  in  the  set  O,  the 
theorem  will  guarantee  that  we  get  the  correct  solution  in  the  new  regions.  In  the 
overlap  areas,  the  stable  state  of  the  network  may  include  some  leftover  ambiguity 
(fz.ti  =  1  for more  than  one  d),  and  even  limit  cycles  involving  a  few  cells.  However, 
these  problematic  areas  will  be  confined  to  layers  of  unit  width  along  the  portions 
of  the  (final)  boundaries  that  lie  inside  the  overlap  regions. 

3.  The  probability  of  finding  wrong  clusters  in  a  binary  stereogram  is  related  to  the 
probability  of  finding  a  repeated  subsequence  on  a  Bernoulli  sequence  of  length 
equal  to  the  total  number  of  disparity  layers,  and  decreases  exponentially  with  the 
number  of  cells  belonging  to  each  of  these  clusters.  For  dense  sterograms  (generated 
by  a  Bernoulli  process  with  parameter  p  =  |),  the  probability  of  finding  a  wrong 
cluster  that  contains  a  square  of  m  cells  per  side  can  be  bounded  by 

Pr(clusler)  < 

where  No  is  the  number  of  disparity  layers,  and  jf2|  is  the  total  number  of  cells  in 
die  lattice.  On  the  other  hand,  a  cluster  of  diameter  k  must  contain  at  least  a  square 
of  side  2k  +  l.  Thus,  if  ND  =  7  and  |f2|  =  642,  for  example,  we  can  guarantee  that, 
for  dense  stereograms,  the  algorithm  will  converge  to  the  correct  solution  in  less 
than  3  iterations  with  probability  >  0.99. 

4.  For  sparse  stereograms,  wrong  clusters  involving  only  "blank"  areas  will  be  very 
common,  but  those  containing  active  tokens  will  be  rare.  This  fact,  together  with 
remark  2,  mean  that,  with  high  probability,  tha  WTA  algorithm  will  find  the  correct 
disparity  at  all  the  sites  that  have  active  tokens.  This  has  been  confirmed  by  our 
experiments. 


5.  Algorithm  (12)  will  not  grow  regions  into  occluded  (uncorrelated)  areas. 
Psychophysical  experiments  show  that  these  areas  should  be  included  with  the 
adjacent  region  that  is  at  the  greatest  depth.  It  can  be  verified  that  an  algorithm 


such  as  the  following: 


fxAt  +  0  — 


U.  if  StfCiV,  fvA1)  >  2fx,d'(l)  fy,d'{ 0  .  d'  A  d 

[o,  otherwise 


with  /Xtrf(0)  =  /*  (i  (the  fixed  point  of  (12)),  will  converge  to  a  solution  in  which 
these  regions  are  correctly  filled  in,  provided  there  are  no  wrong  clusters  in  the 
occluded  regions,  and  that  each  layer  of  constant  d  is  allowed  to  converge  separately, 
starting  with  d  =  dmin  =  min(4  £  Q). 


6.  Note  that  even  when  (x\ ,  z2)  €  fi,  (x(  +  d,  x2)  may  lie  outside  fi  and  so,  if  we  load 
the  network  using  (11),  some  cells  near  the  boundaries  of  fi  may  remain  undefined, 
and  (12)  may  give  incorrect  results.  Therefore,  we  implicitly  assume  the  existence 
of  a  larger  region  fio  D  fi  such  that  for  all  x  £  fi,  is  defined  for  y  e  Nx\J{x} 
and  d  6  Q.  Also,  the  operation  of  (12)  should  be  understood  in  a  modified  sense, 
so  that  fxAt)  =  f°x,d  f°r  all  x  €  fi0  -  fi,  all  d  e  Q,  and  all  t. 


A  useful  corollary  establishes  that  it  is  not  necessary  to  process  all  fi  at  the  same 
time,  but  that  a  complete  representation  can  be  built  up  by  defining  local  networks 
corresponding  to  windows  SCfi,  provided  that  there  is  enough  overlap  between 
them.  In  particular,  we  will  show  that  if  we  load  the  local  network  S  in  such  a  way 
that  its  initial  state  coincides  with  the  initial  state  of  the  complete  network  at  those 
cells,  and  if  the  algorithm  operates  only  on  the  interior  points  of  S,  keeping  the 
state  of  the  boundary  points  fixed,  then  the  final  state  of  the  local  network  at  these 
interior  points  will  correspond  to  the  optimal  solution: 

Let  /[,(x,  d)  and  /*s(z,  d)  be  the  state  of  the  ( x ,  d)  cell  at  time  t  in  the  complete 
and  local  network  respectively.  We  have: 

Corollary  1:  Suppose  the  conditions  of  the  convergence  theorem  hold  in  fi,  and 
consider  a  set  S  C  fi  such  that  the  stereogram  is  not  completely  ambiguous  in 
Si  =  I(S)  (i.e.,  condition  (14)  holds  for  every  x  G  Si).  Suppose  that  we  load  the 
local  network  fs  in  such  a  way  that  for  every  x  e  S,f°s(x,d)  —  fuixA),  for  all 
deQ. 


Then,  algorithm  (12),  modified  in  such  a  way  that  f's{x,d)  =  fs(x,d)  for  all 
t,  all  i  G  S  —  5|,  and  all  d  £  Q,  will  converge  to  a  fixed  point  f’s  lor  which 
fs{x,d)  —  /*)(!,  d)  for  ail  x  belonging  to  unoccluded  regions  inside  S\. 


Proof: 


Consider  a  region  R  of  constant  disparity  d  such  that  R’  —  ftf|Si  7^  0,  and  let 
Z?i  be  the  intersection  of  R  with  the  boundary  of  Si.  For  every  point  x  6  R'  -  B\, 
f]j{x,d)  =  l,  by  the  same  arguments  as  in  the  convergence  theorem.  For  x  G  Bi, 
fs{x,d)  =  1  too,  since  J%[y,d)  =  f^{y,d)  for  y  6  Nx,  and  (13)  holds  in  fi. 
Therefore,  for  every  x  G  R\  Js[x,  d)  =  1 

On  the  other  hand,  for  any  wrong  cluster  Wd<  C  R’  in  layer  d'  7^  d,  since  the 
stereogram  is  not  completely  ambiguous  inside  Si,  there  will  be  at  least  one  point 
x  G  Wd<  such  that  fl(x,d’)  —  0.  Reasoning  as  we  did  before,  we  have  that  for  all 
points  y  G  Nx  f>  W#  f)  R'  we  will  have: 


£  <  E 

z£Ny  *€/V„ 


which  implies  that  =  0. 

Applying  this  reasoning  recursively,  we  get,  for  every  x  G  R\  that  f*s(x,  d)  ~  1,  and 
f*s(x,  d’)  —  0,  d'  ^  d,  which,  together  with  the  convergence  theorem,  completes  the 
proof.i 

Note  that  S  -  St  defines  the  overlap  that  should  exist  among  local  windows,  so  that 
the  complete  representation,  defined  by 

n  =  u4>) 


is  correctly  formed. 


6.1.  Numerical  Results. 


To  test  the  performance  of  algorithm  (12)  with  random  dot  stereograms,  a 
simulator  was  implemented  in  a  Symbolics  3600  computer.  Figure  27  shows  the 
fixed  points  corresponding  to  dense  and  sparse  stereograms  portraying  a  pyramid. 
As  predicted  by  the  theory,  the  convergence  to  the  correct  solution  is  fast  (less  than 
4  iterations)  in  both  cases.  In  the  ease  of  the  sparse  stereogram,  the  boundaries  are 
slightly  misplaced,  but,  as  can  be  verified  by  direct  inspection  of  the  stereogram, 
all  the  dots  are  correctly  located.  The  fixed  point  corresponding  to  the  synchronous 
operation  of  (6)  (obtained  after  11  iterations)  is  also  presented,  for  comparison.  As 
we  can  see,  the  WTA  algorithm  (12)  converges  much  faster  to  a  much  more  precise 
result. 

7.  Recontruction  of  Real  Images. 

To  apply  this  algorithm  to  the  processing  of  real  images,  there  are  some 
modifications  and  extensions  that  should  be  made. 

7.1.  Neighborhood  size. 

It  is  possible  to  increase  the  robustness  of  algorithm  (12)  with  respect  to  the 
presence  of  noise  in  the  images  by  increasing  the  size  of  the  excitatory  neighborhood 
(i.e.,  by  postulating  a  more  global  MRF  prior  model)  and  decreasing  Lhe  value  of 
the  parameter  a.  This  increased  robustness  is  traded  off  by  a  decrease  in  resolution: 
small  correct  regions  may  be  trated  as  "noise",  and  therefore  disappear  from  the 
solution.  Also,  the  shape  of  the  piecewise  constant  regions  may  be  altered  (comers 
may  be  rounded  and  small  concavities  'Tilled  in"). 

7.2.  Token  Selection. 

The  simple  rule  (11)  is  adequate  for  measuring  the  compatibility  with  the 
observations  in  the  case  of  a  synthetic  image  (such  as  a  random  dot  stereogram). 
However,  it  will  not  work  in  the  case  of  continuous-toned  images  of  real  objects. 
The  reasons  for  this  failure  are  manifold:  the  distribution  of  the  reflected  light 


(e) 


Figure  27.  (a)  Dense  Stereogram  (density  =  0.4)  portraying  a  pyramid,  (b)  Fixed  point  for 
algorithm  (12)  (e)  Sparse  stereogram  (density  =  0.1)  portraying  a  pyramid,  (d)  Fixed  point  for 
algorithm  (12).  (c)  Fixed  point  for  the  Synchronous  algorithm  (6). 


varies  as  the  viewpoint  is  changed  (particularly  the  specular  component),  and  the 
two  retinas  (cameras)  may  have  dill'erent  point  spread  functions,  and  he  affected 
by  independent  sources  of  noise.  This  means  that  the  model  for  the  observation 
process  given  by  equation  (3)  should  be  replaced  by  another  that  reflects  the  process 
of  formation  of  natural  images  in  a  more  realistic  way.  The  use  of  a  better  model 
will  cause  the  term  fxil  in  equation  (12)  to  be  replaced  by  a  different  compatibility 
measure  rjXill  which  is  obtained  by  first  preprocessing  the  right  and  left  images  using 
an  operator  T  whose  output  should  be,  ideally,  invariant  under  the  changes  in 
viewpoint,  optics,  etc.,  and  then  computing  a  suitable  defined  distance  D  between 
the  two  processed  images: 


‘Hx.d  =  D{T gn(x  +  d),  T g/,(z)) 


(15) 


(note  that  rj  may  be  continuous-valued). 


The  new  WTA  algorithm  will  be: 


fx,d{t  + 


if  Ux,a(0  =  maXergQ  uIi(/.(t) 

otherwise 


«z,d(*)  =  (*Vi,d  +  Pn[/^,  x,  d) 


(16) 


The  operator  PN  is  generated  by  the  enlarged  MR.F  model,  and  in  general  it  will 
represent  a  weighted  average  of  the  values  of  the  field  in  the  enlarged  neighborhood: 

Pn{/ >  x>  d)  =  J2  c[\x-y\)fXtd  (17) 


where  Nx  is  the  extended  neighborhood  of  z  and  c(  )  denotes  a  set  of  parameters 
that  depend  only  on  the  distance  |z  -  yj,  and  are  related  to  the  prior  MRF  model 
for  the  disparity.  f°  may  be  chosen  as: 


if  hz,<i  —  maXrgQ  *7x,r 

otherwise 


The  convergence  of  this  modified  algorithm  to  the  correct  solution  can  still  be 
guaranteed  ifeondition  (13)  is  replaced  by  the  requirement  that  the  cell  corresponding 


to  the  correct  layer  of  every  non-occluded  point  should  be  maximally  stimulated  at 
lime  0,  with  respect  to  the  other  cells  in  the  same  column,  by  neighbors  belonging 
to  the  same  constant  disparity  region: 


“Vi,*.  +  P{N{f°>x>di )  >  <xVz,d  +  PN(f°,x,d)  (18) 

for  every  region  /?,  ;  every  x  G.  Ri  and  every  d  e  Q.  P$  is  the  operator  P/v  restricted 
to  Ri  '. 

pN(f>x>d)=  5Z  c(\x  -  y\)fv,d 

v€/v,n«. 

(this  modification  is  necessary  to  cover  the  case  in  which  a  point  near  the  boundary 
of  a  constant  disparity  region  is  partially  stimulated  by  a  wrong  cluster  outside  this 
region  which  may  disappear  in  succeeding  iterations). 

Condition  (14),  i.e.,  the  requirement  that  every  wrong  cluster  has  less  "on” 
neighbors  at  time  0  on  the  wrong  layer  than  in  the  correct  one,  can  now  be  expressed 
by  requiring  that  for  every  region  /?,  and  every  wrong  cluster  Wy  on  layer  j  of  Ri, 
there  is  at  least  one  point  x  £  Ri  D  Wy  such  that: 


<  PN(f,z,di) 


(19) 


Under  these  conditions,  it  is  easy  to  use  the  same  arguments  of  the  proof  of 
the  convergence  theorem  to  verify  the  convergence  of  algorithm  (16).  It  should  be 
remarked  that  conditions  (18)  and  (19)  are  sufficient,  but  by  no  means  necessary; 
(16)  may  converge  to  the  correct  solution  even  if  they  are  violated  by  a  particular 
stereogram. 


The  determination  of  the  optimal  operators  D  and  T  in  equation  (15)  is  a 
difficult  —  and  as  yet  unsolved  problem.  One  scheme  that  has  often  been  used  is 
to  define  T  as  a  convolution  operator  whose  kernel  is  the  Laplacian  of  a  Gaussian 
function  ,  and  T  as: 


if  ab  >  0 
otherwise 


(see  Marr  and  Poggio,  1979).  The  rationale  for  this  choice  is  that  the  zero  crossings 
of  the  convolution  with  the  Laplacian  operator  should  pick  the  places  where  large 
intcnsitv  changes  occur  in  both  images  (i.e.,  it  acts  as  an  "edge  detector"),  while  the 
Gaussian  kernel  has  the  elfect  of  smoothing  out  the  "irrelevant"  edges  and  filtering 
out  the  noise.  One  difficult),  however,  is  that  if  the  Gaussian  mask  is  large  enough 
to  produce  the  desired  efTect,  it  will  also  introduce  errors  in  the  localization  of  the 
zero  crossings  of  the  convolved  images,  which  will  translate  into  errors  in  the  depth 
of  the  reconstructed  surface  (see  Clark  and  Lawrence,  1985). 

We  have  found  that  the  normalized  absolute  value  of  the  Laplacian  of  the 
difference  between  left  and  right  images: 

— u(i,  d )  +  maxr6Q  v(x,  r) 

^x,d  maxrgQ  v(x,  r)  —  minrgQ  v(x,  r) 

with 

v{x,  d)  =  |V2(ff/e{z  +  d)  -  gi(i))|  (20) 

has  relatively  good  experimental  behavior,  but  clearly,  much  more  research  is 
needed  in  this  area. 

It  is  important  to  note  that  the  definition  of  rj  will  affect  the  performance  of  the 
WTA  algorithm,  since  it  will  determine  the  extent  to  which  conditions  (18)  and  (19) 
hold  in  the  initial  state;  the  structure  of  the  WTA  network,  however,  is  independent 
of  the  choice  of  rj,  so  that  the  experimentation  with  different  definitions  can  be 
done  very  efficiently. 

7.3.  Uniqueness  Constraint. 

The  definitions  (12)  and  (16)  imply  the  enforcement  of  the  constraint: 

"Each  point  in  the  left  image  should  be  matched  by  only  one  point  in  the  right 
image". 

That  is  to  say,  we  are  enforcing  the  uniqueness  constraint  along  the  left  eye 
line  of  sight.  It  is  also  possible  to  include  explicitly  the  corresponding  constraint  for 


the  right  eye  (as  in  Marr  and  Poggio,  1976).  This  is  done  by  replacing  (16)  (or  (12)) 
with: 

(1,  if  =  ma \,reQ  ux<(l>(t) 

and  ux<d  =  maxkul+keQ  ux 
0,  otherwise 

For  perfect  observations,  this  additional  constraint  is  redundant.  If  noise  or  other 
distortions  are  present,  however,  this  scheme  will  have  better  performance,  since  the 
disparity  of  "doubtful"  points  will  be  left  unassigned  (the  corresponding  values  of 
the  disparity  in  these  locations  may  be  determined  after  convergence  by  the  robust 
surface  reconstruction  techniques  described  in  chapter  5). 

As  an  example  of  the  application  of  this  technique,  the  processing  of  a  stereo 
pair  of  aerial  photographs  is  illustrated  in  figure  28  (this  stereo  pair  is  the  same  that 
was  used  in  chapter  5;  see  figure  19).  Although  it  is  difficult  to  assess  objectively 
the  performance  of  an  algorithm  on  this  type  of  images,  the  quality  of  these  results 
seems  at  least  equivalent  to  that  obtained  by  state-of-the-art  systems  (see  Grimson, 
1984). 

7.4.  Piecewise  Smooth  Surfaces. 

The  WTA  scheme  can  also  be  applied  to  reconstruct  disparity  surfaces  that 
are  piecewise  smooth.  To  do  this,  it  is  only  necessary  to  modify  the  definition  of 
the  operator  PN  (equation  (16)),  so  that  cells  at  nearby  depths  are  also  taken  into 
account.  Notice  that,  in  order  to  be  consistent  with  the  WTA  mechanism,  only  the 
maximum  contribution  for  any  given  column  should  be  considered.  The  modified 
operator  is: 

P/v(/,x,4)  =  Y,  max{c(|i  -  y|,  |4  -  r|)/y,r)  (21) 

veN, re/v" 

where  c(-,  •)  is  some  fixed  decreasing  function  of  its  arguments,  and  Nd  is  a  disparity 
neighborhood  defined  as  the  intersection  of  a  closed  interval  with  the  set  of  allowable 
disparities: 


Nd  =  [d-p,d  +  p]f)Q 


where  p  is  a  positive  constant. 

The  sufficient  conditions  for  the  convergence  of  the  modified  algorithm,  namely, 
that  the  stereogram  should  be  unambiguous  and  have  well  defined  boundaries  with 
respect  to  the  the  modified  operator  P can  also  be  expressed  in  the  form  given 
by  equations  (18)  and  (19),  but  now  a  wrong  cluster  W3  should  be  defined  as  a 
connected  region  on  the  disparity  layer  d3  such  that  /®  d.  —  1,  and  d3  ^  d'(x)  for 
all  i  €  Wj,  where  d*(x)  is  the  true  disparity  at  point  x.  The  proof  of  the  convergence 
theorem  is  straightforward,  but  the  interpretation  of  these  conditions  is  not  obvious, 
and  in  practice,  they  are  very  difficult  to  verify,  so  that  at  this  point,  the  performance 
of  this  algorithm  should  be  assessed  experimentally. 

Pradzny  (1984)  (see  also  Pollard  et.  al.,  (1984))  has  obtained  good  results  for 
the  reconstruction  of  piecewise  smooth  and  "transparent"  surfaces  (i.e.,  stereograms 
portraying  sets  of  small  interspersed  patches  that  belong  to  two  smooth  surfaces, 
one  in  front  of  the  other)  using  an  operator  of  the  form: 

Pn{/,  x,  d)  =  £  X)  {c(lz  “  y\>\d  ~  r\)fy,r) 

v€N,  r£Nd 

We  believe  that  the  use  of  (21)  should  improve  the  performance  in  these  cases. 

8.  Discussion 

In  this  chapter  we  have  studied  a  class  of  recontruction  problems  that  arise 
when  the  conditional  distribution  of  the  observations  is  a  multimodal  function, 
which  causes  the  solution  to  remain  ambiguous,  even  for  arbitrarily  high  signal  to 
noise  ratio.  We  identified  the  signal  matching  problem  as  one  of  the  most  important 
instances  of  this  class,  and  gave  a  probabilistic  formulation  for  it  using  a  MRF 
model  to  model  the  disparity  surface,  so  that  the  optimal  estimation  algorithms 
derived  in  chapter  3  could  be  applied. 

We  then  presented  a  different  approach  to  the  solution  of  the  problem  in 
which  the  constraints  derived  both  from  the  prior  MRF  model  for  the  disparity 
field  and  from  the  observations  arc  implemented  directly  as  excitatory  connections 


on  a  three-dimensional  cooperative  network  of  processors  (or  "cells")  with  binary 
state  space.  The  steady  state  of  this  network  can  be  unambiguously  interpreted  as  a 
disparity  surface  only  if  there  is  exactly  one  processor  in  each  column  whose  state 
is  equal  to  1.  This  imposes  a  uniqueness  constraint  which  can  be  enforced  either 
by  introducing  inhibitory  linear  connections,  or  by  a  "Winner-take-aH"  mechanism 
that  operates  within  each  column.  We  showed  that,  ior  high  signal  to  noise  ratio,  it 
is  possible  to  define  precise  sufficient  conditions  (which  are  usually  met  in  the  case 
of  synthetic  images)  for  the  convergence  of  the  state  of  this  "WTA"  network  to  the 
correct  solution  in  a  small  number  of  iterations. 

The  experimental  performance  of  this  algorithm  with  random  dot  stereograms 
is  excellent;  it  produces  accurate  reconstructions  in  a  very  short  time  (in  less  than  5 
iterations).  In  the  case  of  the  reconstruction  of  real  objects  from  stereoscopic 
photographs,  this  algorithm  —  with  some  modifications  —  produces  results 
comparable  with  those  obtained  by  more  complicated  schemes  that  are  considered 
"state  of  the  eat",  and  it  has  the  advantage  of  being  directly  implementable  in 
parallel  hardware. 

It  should  be  noted  that  the  performance  of  the  stereoscopic  vision  of  human 
beings  on  similar  data  is  still  dramatically  superior  to  that  of  this,  or  any  other 
existing  artificial  system.  Some  issues  that  should  be  addressed  for  the  development 
of  more  effective  algorithms  are  the  following; 

(i)  More  realistic  models  for  the  observation  process  that  take  into  account 
the  nature  of  the  relative  distortions  of  the  left  and  right  images  should  be 
constructed.  This  should  lead  to  the  definition  of  optimal  combinations  of 
tokens  for  the  matching  process.  The  precise  nature  of  the  optical  system 
used  (which  may  cause  problems  like  non-horizontal  epipolar  lines;  vertical 
disparities,  etc.)  should  also  be  taken  into  account. 

(ii)  The  use  of  more  sophisticated  prior  models  for  the  disparity  field  — 
including  a  coupled  line  field  as  described  in  chapter  5  —  should  be 
investigated. 

(iii)  Since  the  intensity  edges  and  the  regions  of  uniform  intensity  (or  uniform 
texture)  of  the  images  are  natural  candidates  for  becoming  stereo  matching 
tokens,  and  the  location  of  depth  and  intensity  (or  texture)  edges  is  likely 
to  be  correlated  in  a  natural  scene,  the  integration  of  edge  detection; 


image  segmentation;  stereo  matching  and  surface  reconstruction  into  a 
single  estimation  process  should  produce  very  good  results.  The  Bayesian 
approach,  and  the  use  of  coupled  MRF  models  for  describing  surfaces 
and  edges  that  we  have  presented  in  this  thesis  should  provide  a  unified 
framework  for  performing  this  integration.  We  discuss  this  point  further 
in  the  next  chapter. 


Appendix  6. A 


DYNAMIC  PROGRAMMING  APPROACH  TO  SIGNAL  MATCHING 


Consider  the  one-dimensional  version  of  the  signal  matching  problem  described 
in  section  2.  To  compute  the  MAP  estimate,  we  need  to  find  the  global  minimum 
of: 

u<‘ (4  s)  =  ~  E  m.  <0  +  5  £  M<1  In  2  + 

+  |  O1  “  M'Wisdi)  -  9r(*  +  4.-)}  (1) 

(i.e.,  equation  (4))  The  use  of  the  dynamic  programming  algorithm  described  in 
appendix  4.B  is  complicated  by  the  fact  that,  given  the  boundaries  between 
regions  of  constant  disparity,  the  optimal  estimate  for  d  in  the  interval  (Z,t-,4+ i] 
depends  on  the  estimate  on  since  this  last  choice  determines  the  extent 

of  the  occluded  region. 

However,  if  we  assume  that  the  size  of  the  regions  of  constant  disparity  is 
relatively  large  compared  with  the  size  of  the  occluded  areas  (as  it  normally  happens 
in  most  practical  cases),  we  can  estimate  d  given  Ln  using  the  formula: 

d((Li,  == 

—  {k:  Y  S(gL(i)  -  gH(i  +  k))  <  ]T  %/,(i)  -  p«(t  +  /)),  for  all  l£  <2}  (2) 

i^L*- f-1  1 

Defining: 

Gk,l  =  Y  %i(»)  ~  9li{i  +  d{(k>  *])))  (3) 

«=fc+l 


and 


A  A 

Ai  =  max(0,  i  —  di ) 


(note  that  A,  corresponds  to  the  length  of  the  occluded  region  when  a  change  in 
the  estimated  disparity  occurs),  we  get  that 

Ur(i,9)  =  ^+U{L) 

with 

U{L)  =  Gi.l,  4-  Ai  In  2  +  Gi ,+^,+1,^*  + 

+A2  In  2  +  ...  +  G/^+An+l./y 

We  can  now  perform  the  global  minimization  of  U  using  the  dynamic  programming 
scheme  of  appendix  4.B.  In  this  case,  however,  it  is  better  to  use  "forward" 
recursions,  (in  the  sense  that  now  Fy(fc)  will  represent  the  cost  associated  with 
putting  j  boundaries,  in  the  best  possible  locations,  in  the  interval  [1,  k ]),  because 
occlusion,  as  we  have  defined  it,  always  takes  place  from  left  to  right.  We  have 
then: 

F0(k)  -  Gv,k 
Lo{k)  =  1 

Fj+l(k)  —  i”f  {^i+Ay+l.i  +  Fy(t)  +  Ay  In  2} 

Ly+l(fc)  =  {L:Gl+^+ 1,*  +  Fy(L)  +  Ay  In  2  =  Fy+l(fc)} 

The  optimal  location  of  the  boundaries,  for  any  given  n  is: 

Sn  =  {Ln(N),  L„-i(Ln(N)), . .  ,  Ly(Li(. .  .(Ln(N)). . .)} 

The  optimal  configuration  is  computed  using  (2),  and  the  corresponding  energy, 
using  (1). 

Note  that  as  the  size  of  the  regions  of  constant  disparity  decreases,  t2)  may  not 
be  well  defined  (the  optimal  estimate  d  may  not  be  unique)  and  a  more  complex 
optimization  procedure  may  be  required. 


Chapter  7 


CONCLUSIONS 


In  this  thesis  we  have  presented  a  probabilistic  approach  to  the  solution  of 
a  class  of  perceptual  problems.  We  showed  that  these  problems  can  be  reduced 
to  the  recontruction  of  a  function  on  a  finite  lattice  from  a  set  of  degraded 
observations,  and  derived  the  Bayesian  estimators  that  provide  an  optimal  solution. 
We  have  also  developed  efficient  distributed  algorithms  for  the  computation  of  these 
estimates,  and  discussed  their  implementation  in  different  kinds  of  hardware.  To 
demonstrate  the  generality  and  practical  value  of  this  approach,  we  studied  in  detail 
several  applications:  the  segmentation  of  noise-corrupted  images;  the  formation  of 
perceptual  clusters;  the  recontruction  of  piecewise  smooth  surfaces  from  sparse  data 
and  the  reconstruction  of  depth  from  stereoscopic  measurements. 

This  methodology  also  permits,  in  principle,  the  incorporation  of  more  than  one 
modality  of  observations  into  a  single  estimation  process,  as  well  as  the  simultaneous 
estimation  of  several  related  functions  from  the  same  data  set.  This  makes  one  hope 
that  this  framework  could  be  useful  in  the  solution  of  difficult  problems  that  require 
such  an  integrated  approach.  We  mention  two  examples: 

1.  We  mentioned  in  chapter  6  that  the  stereo  matching  problem  in  real  situations 
has  not  been  solved  yet  in  a  satisfactory  way.  The  same  can  be  said  of  other  related 
perceptual  problems  such  as:  edge  detection;  image  segmentation;  the  recovery 
of  the  shape  of  an  object  from  a  single  two-dimensional  image  (the  "shape  form 
shading"  problem),  and  the  segmentation  of  a  scene  into  distinct  objects,  as  well 
as  the  recovery  of  their  three-dimensional  structure  from  the  analysis  of  images 
formed  at  successive  instants  of  time  (the  "structure  from  motion"  problem).  All 
these  problems  are  obviously  related,  and  it  is  intuitively  clear  that  the  individual 
solutions  that  can  be  obtained  should  improve  if  the  mutual  constraints  that  the 
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solution  of  each  individual  problem  imposes  on  the  others  were  taken  into  account. 
Thus,  the  presence  of  a  brightness  edge  should  increase  the  likelihood  of  a  depth 
edge,  and  viceversa;  the  depth  estimated  from  stereo  should  be  compatible  with 
the  shape  derived  from  shading;  points  belonging  to  the  same  region  in  an  image 
should  move  together,  etc.  We  believe  that  these  constraints  can  be  incorporated 
in  the  potential  functions  of  the  corresponding  MRF  models  (in  particular,  of  the 
coupled  fields  that  represent  the  "lines"  or  edges  in  each  case;  see  chapter  5). 

2.  The  processing  and  interpretation  of  geophysical  information  (as  is  done,  for 
example  in  oil  prospecting)  attempts  to  reconstruct  subterranean  geological  structures 
from  information  provided  by  a  set  of  qualitatively  different  measurements,  such  as 
those  obtained  by;  gravimetric  and  magnetometric  surveying;  reflexion  seismology; 
measurements  of  physical  properties  taken  vertically  along  bore  holes  ("well  logs"), 
etc.  Since  all  these  measurements  are  obtained  independently,  their  joint  conditional 
probabilities  can  be  easily  determined,  and  since  all  of  them  refer  to  the  same 
physical  structures,  their  processing  can,  in  principle,  be  integrated  into  a  single 
estimation  process,  which  should  greatly  increase  the  reliability  of  the  results. 

The  above  considerations  may  be  taken  one  step  further.  Ultimately,  the  results 
one  is  interested  in  are  not  only  the  quantitative  reconstruction  of  some  surfaces,  but 
the  symbolic  description  of  the  scene  in  terms  of  functional  structures  or  "objects". 
On  the  other  hand,  the  prior  knowledge  about  the  occurance  of  a  particular  object 
or  class  of  objects  might  greatly  simplify  the  tasks  of  the  "low  level"  processors 
(for  example,  a  letter  recognition  algorithm  should  greatly  benefit  from  the  use 
of  context,  given  the  probabilities  of  occurance  of  certain  letter  combinations  or 
words).  The  Bayesian  approach  provides  a  common  "language"  that  may  allow 
these  low-level  and  high-level  (or  symbolic)  processes  to  communicate  and  mutually 
enhance  their  performance. 

As  a  simple  example  of  this  situation,  suppose  that  we  are  interested  in  finding 
a  symbolic  description  of  a  binary  pattern  /  in  terms  of  a  set  of  geometric  objects 
(such  as  squares,  triangles,  etc.)  that  are  characterized  by  some  parameters  (such 
as  position,  orientation,  size,  etc.)  for  whose  values  we  have  some  prior  probability 


knowledge. 

Given  a  description  D,  i.e.,  a  list  of  objects  with  a  set  of  particular  values  for 
their  parameters,  we  can  find  a  binary  field  q  which  corresponds  to  the  boolean  sum 
of  the  indicator  functions  of  the  objects  included  in  P: 


I1’ 

<7.  =  \ 
10, 


1,  if  an  object  in  P  covers  pixel  * 


otherwise 


We  can  now  write  the  joint  prior  distribution  for  the  field  /  (which  represents  the 
actual  intensity  of  the  noise-free  image)  and  its  description  as: 

P(f,  0)  =  P{f  \  0)P{D) 

To  compute  P(f  |  P),  we  assume  that  /  is  a  first  order  MRF  whose  configuration 
is  biased  by  P: 

py  |  P)  =  i  exp[— =r  E  m.  fj)  +  X  E  «/.'] 

^  2°  *.J  »' 

P{D )  can  be  computed  from  the  prior  probabilities  for  the  occurance  of  each  type 
of  object,  and  from  the  prior  distributions  for  the  values  of  the  corresponding 
parameters.  Since  the  conditional  distribution  of  the  observations  depends  directly 
only  on  /,  the  posterior  distribution  will  be: 


P(f,P\9)  = 


P(g  I  mr,  D) 


where  P{g)  is  a  constant  From  this  expression  we  can  compute  the  optimal  estimates 
for  /  and  D  using  methods  similar  to  the  ones  developed  here. 

We  will  now  present  a  summary  of  our  main  results  and  a  fist  of  some  interesting 
open  technical  questions. 

1.  Summary  of  our  Main  Results. 

I.l.  Optimal  Bayesian  Estimators. 
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Several  researchers  have  used  Bayes  theory  and  MRF  models  for  the  restoration 
of  piecewise  uniform  images.  It  has  been  implicitly  assumed  by  all  of  them  that  the 
maximization  of  the  posterior  probability  is  the  best  possible  performance  criterion. 
We  have  shown  that  it  is  possible  to  choose  other  criteria  that  are  better  adapted 
to  each  particular  problem,  and  have  derived  the  corresponding  optimal  estimators, 
which  not  only  improve  substantially  the  quality  of  the  results  (particularly  for 
low  signal  to  noise  ratios),  but  also  lead  to  more  efficient  and  better  behaved 
computational  schemes. 

1.2.  General  Monte  Carlo  Algorithms. 

We  have  shown  that  the  optimal  Bayesian  estimators  can  be  obtained  from 
the  observation  of  the  equilibrium  behavior  of  a  MRF  (which  in  physical  terms 
correspond  to  a  ferromagnet  subject  to  a  spatially  varying  external  magnetic  field). 
This  behavior  can  be  effectively  simulated  by  Monte  Carlo  procedures  which 
generate  a  regular  Markov  chain  with  an  invariant  Gibbs  measure. 

This  method  differs  from  "simulated  annealing"  (which  has  been  used  to 
approximate  the  MAP  estimator)  in  that  it  is  based  on  the  collection  of  statistics  of 
the  evolution  of  the  chain  at  a  fixed  temperature,  while  the  latter  attempts  to  find  the 
ground  state  of  the  coupled  system  by  slowly  decreasing  it.  From  a  computational 
viewpoint,  our  method  exhibits  a  faster  and  more  consistent  convergence  behavior. 

1.3.  Parallel  Implementations. 

The  implementation  of  this  general  Monte  Carlo  procedure  in  parallel  hardware 
was  discussed.  We  proved  that  the  Gibbs  sampler  (but  not  the  Metropolis  or  Heat 
Bath  algorithms)  will  produce  consistent  results  in  this  case. 

1.4.  Reconstruction  of  Piecewise  Constant  Funcions. 

The  problem  of  reconstructing  a  piecewise  constant  function  from  noisy 
(but  dense)  observations  was  formulated  in  probabilistic  terms,  and  the  form  of 
the  optimal  estimators  derived.  For  the  one-dimensional  case,  we  presented  a 
deterministic  algorithm  with  minimal  complexity  which  computes  (exactly)  the 
MAP  estimate  of  binary  fields.  For  the  two-dimensional  case,  we  presented  a 


method  for  improving  the  computational  efficiency  of  the  "Simulated  Annealing" 
scheme  for  approximating  the  MAP  estimator,  and  derived  a  fast  algorithm  for 
approximating  the  optimal  (MPM)  Bayesian  estimator. 

We  also  presented  a  maximum  likelihood  procedure,  which  based  on  an  analysis 
of  the  residual  ("innovations")  process  permits  the  simultaneous  estimation  of  the 
field  and  the  parameters  of  the  system.  We  applied  this  technique  to  the  construction 
of  a  parameter-free  algorithm  for  the  reconstruction  of  arbitrary  binary  patterns. 

1.5.  Formation  of  Perceptual  Clusters. 

We  suggested  that  the  process  of  formation  of  perceptual  clusters  of  certain 
dot  patterns  can  be  modeled  in  terms  of  the  estimation  of  binary  images  corrupted 
by  multiplicative  noise,  and  illustrated  the  application  of  our  estimation  algorithm 
to  this  task. 

1.6.  Reconstruction  of  Piecewise  Continuous  Surfaces. 

The  problem  of  simultaneously  detecting  the  discontinuities  and  recontructing 
a  piecewise  smooth  surface  from  sparse  observations  was  cast  in  the  Bayesian 
framework.  A  model  consisting  of  two  coupled  MRF’s:  one  representing  the 
depth  and  the  other  the  boundaries  between  continuous  regions,  was  adapted  to 
our  problem.  Since  the  straightforward  use  of  the  general  Monte  Carlo  algorithm 
for  finding  the  optimal  estimate  is  computationally  unfeasible  in  this  case,  an 
approximation  (which  showed  an  excellent  experimental  performance  with  both 
synthetic  and  "real"  data)  was  derived  and  implemented.  We  also  developed,  and 
heuristically  justified  a  fast  algorithm  that  produces  results  that  are  practically 
indistinguishable  from  the  optimal  ones.  The  implementation  of  these  procedures 
in  digital  parallel  hardware,  as  well  as  in  hybrid  and  analog  networks  was  also 
discussed. 

1.7.  Signal  Matching. 

We  presented  a  class  of  problems  that  is  characterized  by  the  fact  that  the 
conditional  probability  distribution  of  the  observations  P{g  |  /)  is  multimodal  (as  a 
function  of  /),  which  means  that  the  solution  remains  ambiguous,  even  for  arbitratily 


high  signal  to  noise  ratios.  We  studied  a  prototype  problem  of  this  class:  the  signal 
matching  problem  (in  particular,  the  reconstruction  of  depth  from  stereoscopic  pairs 
of  images),  and  showed  that  it  is  possible,  in  principle,  to  find  the  solution  using  the 
general  estimation  procedures  that  we  have  developed  (although  the  computational 
cost  will  be  high  in  the  general  case).  We  also  presented  a  different  scheme,  which 
is  based  on  the  direct  implementation  of  the  local  constraints  (generated  by  the 
probabilistic  model)  in  a  highly  distributed  cooperative  network  of  a  particular 
form:  a  "Winner-Take-AU"  network,  and  showed  that  the  state  of  this  network 
will  converge  to  the  correct  solution  in  a  few  iterations  (in  the  high  SNR  case).  The 
application  of  this  technique  to  the  reconstruction  of  the  depth  of  real  objects  from 
stereoscopic  photographs  was  discussed,  and  some  modifications  to  the  algorithm 
were  introduced,  which  permitted  us  to  produce  results  which  compare  favourably 
with  those  of  other  "state  of  the  art"  algorithms. 

2.  Open  Technical  Questions. 

2.1.  Stochastic  Models. 

We  have  shown  throughout  this  work  the  richness  and  versatility  of  simple  (first 
and  second  order)  MRF  models.  It  is  clear.however,  that  there  are  classes  of  physical 
structures  whose  behavior  cannot  be  adequately  modeled  by  these  processes  (as  a 
simple  example,  consider  images  formed  by  clusters  of  blobs  of  certain  average  size). 
There  have  been  some  attempts  to  model  these  and  other  "textured"  patterns  via  a 
hierarchy  of  independent  MRF’s:  one  that  represents  the  structure  of  the  image,  in 
terms  of  regions  of  uniform  texture,  and  individual  models  for  each  textured  regions. 
This  representation,  however,  is  not  very  convenient  for  estimation  purposes.  A 
more  rigorous  approach  has  been  suggested  by  Grenander  (1984),  who  has  proposed 
the  use  of  generalized  Markovian  fields  to  model  complex  patterns;  these  fields 
consist  of  several  layers  of  "generators",  which  in  the  first  layer  correspond  to 
grey  levels,  and  in  the  succeeding  ones,  to  features  of  increased  complexity  (lines, 
corners,  etc.).  It  is  not  clear,  however,  how  to  use  this  approach  to  construct  models 
of  textured  images;  objects  of  different  shapes,  etc. 


These  considerations  suggest  the  need  for  much  more  research  in  this  area, 
which  should  include,  perhaps,  the  use  of  probabilistic  models  that  arc  not  based 
on  the  Gibbs  distribution. 

2.2.  Multiple  Scale  Representations. 

It  is  the  current  view  that  the  production  of  high-level  (symbolic)  descriptions 
of  a  scene  should  be  mediated  by  the  construction  of  numerical  descriptions  of 
the  surfaces  involved  at  different  "scales".  The  parameters  that  describe  a  MRF 
play  in  some  sense  the  role  of  scale  parameters  (see  figure  1  of  chapter  1;  section 
5  of  appendix  4.B  and  section  6  of  chapter  5);  this  identification,  however,  is  not 
completely  satisfactory.  A  good  multiscale  representation  should  feature  not  only  a 
progressive  blurring  of  detail,  but  the  aggregation  of  substructures  into  larger  units 
in  a  way  that  is  not  accomplished  by  the  current  algorithms. 

2.3.  Parameter  Estimation. 

Intimately  liked  with  the  previous  questions,  is  the  determination  of  the  optimal 
set  of  parameters  of  a  given  model  from  noisy  samples.  The  maximum  likelihood 
method  that  we  have  presented  here  (see  chapter  5)  becomes  computationally 
unfeasible  as  the  complexity  of  the  model  (the  dimensionality  of  the  parameter 
space)  increases;  therefore,  alternative  procedures  need  to  be  derived  (for  instance, 
the  use  of  time-varying  algorithms,  such  as  the  one  presented  in  section  6  of  chapter 
5  should  be  more  rigorously  investigated). 

A  related  (and  more  difficult)  question  is  the  selection  of  the  optimal  model 
from  a  certain  class  given  only  the  noisy  observations.  It  is  possible  that  the  ideas  of 
Rissanen  (1978, 1981,  1983)  about  "minimum  description  length"  schemes,  and  also 
of  Akaike  (1977)  about  generalized  maximum  likelihood  methods  could  be  useful  in 
this  connection,  although  the  high  computational  complexity  of  the  present  problem 
might  limit  the  applicability  of  these  techniques. 

2.4.  Fast  Algorithms. 

The  practical  use  of  the  general  Monte  Carlo  estimation  algorithms  of  chapter 
3  is  limited  by  the  relatively  large  number  of  iterations  needed  for  the  convergence 


of  these  systems.  A  very  important  question,  then,  is  how  to  improve  on  the 
convergence  time  without  sacrificing  the  power  of  these  methods.  The  use  of 
'‘multigrid"  type  strategies  (Brandt,  1973;  Terzopoulos,  1984),  which  in  the  present 
case  may  take  the  form  of  "block-spin""  algorithms,  such  as  the  one  presented  in 
chapter  4  (see  also  White,  1983)  should  be  investigated. 

Also  in  this  connection,  it  should  be  interesting  to  find  more  rigorous  justifications 
for  the  performance  of  the  fast  deterministic  schemes  that  we  have  developed,  based 
on  heuristic  considerations,  in  chapters  4  and  5,  to  see  if  it  is  possible  to  find  some 
general  principles  that  may  guide  the  extension  of  these  schemes  to  other,  more 
general  cases. 

2.5.  Analog  Computers. 

It  would  be  interesting  to  actually  construct  prototypes  of  the  hybrid  and  analog 
networks  described  in  chapter  4  and  5,  to  assess  the  practicality  and  performance  of 
such  schemes.  A  more  intriguing  possibility  is  to  exploit  the  isomorphism  between 
the  estimation  process  of  a  MRF  from  noisy  data,  and  the  equilibrium  behavior 
of  a  ferromagnet  with  a  coupled,  spatially  varying  external  field  (see  chapter  3),  to 
construct  very  fast,  special  purpose  "quantum"  computers  to  perform  the  former 
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